About Clustering (k-means, hierarchical clustering, DBSCAN)

송형석·2023년 9월 3일

Hierarchical Clustering clustering dbscan k-means

Lab Seminar

목록 보기

1/2

들어가며

석사 졸업 후 그동안 석사과정 중 했던 프로젝트, 기업과제, 논문, 랩세미나, 개인 공부 등을 정리해야겠다 싶어 하나씩 정리하는 중이다.

그 첫번째 과정으로 랩 세미나에서 발표했던 자료를 다시 짚어보며 복습 겸 정리중에 있는데, 밤을 새며 조사하고 준비했던 추억.. 이 새록새록 돋아난다.

랩 세미나 포스팅 순서에도 나름의 이유가 있는데, 내가 연구한 분야는 라이다 캘리브레이션 및 라이다 포인트 클라우드 데이터를 활용한 딥러닝 기반 3D 물체 인식 분야이다. 그렇기에 머신러닝과 점군(포인트 클라우드) 데이터 특성을 이해하기에 기초가 되는 클러스터링 기법부터 조사 했었다. 이후 k-d tree와 다양한 3차원 물체 인식 딥러닝 모델에 대해 조사 및 스터디를 진행하였고 랩세미나에서 발표했었다.

지금 돌아보면 gif 파일도 찾아가며 제일 예쁘게 준비한 자료가 아니였나 싶다.. 개념이 쉬워서 좀 더 정성을 들일 여유가 있었는지도?

해당 포스트에서는 기본적인 클러스터링 기법인 k-means, 계층적 클러스터링, DBSCAN 기법에 대해 소개하며 각각의 디테일한 요소 또한 살펴볼 것이다.

(발표자료를 그대로 가져왔기에 영어 기반이다. 설명이 부족한 부분은 추가로 한국어로 밑에 기술하였다.)

Clustering

Basically, clustering is an unsupervised learning method, so it learns without a label.
Clusters are classified based on similarity without knowledge of the samples.
To evaluate the similarity between clusters, various distance measurement functions are used, for example, Euclidean distance, Mahalanobis distance, Lance-Williams distance, Hamming distance, etc.

k-means

k-means clustering (K=4)
위의 그림은 k-means 클러스터링이 진행되는 과정을 보여준다. 자세한 설명은 바로 뒤의 알고리즘 파트에서 설명하겠다.

Algorithm

Input
- K (Number of clusters)
- Training set
  $\{x^{(1)},x^{(2)},x^{(3)},\cdots,x^{(m)}\}$
  주어진 데이터(Training set)을 K분할로 나누기 위한 입력 값
Randomly initialize K cluster centroids
$\mu_1,\mu_2,\cdots,\mu_k \in \mathbb{R}^n$
Cluster indexing
$for\;i= 1\;to\;m \\c^{(i)}=index(from\;1\;to\;k)\;of\;cluster\;centroid\;closet\;to\;x^{(i)}$
Centroid update

for\;k= 1\;to\;k \\\mu_k=average(mean)\;of\;points\;assigned\;to\;cluster\;k

Optimization objective

optimization objective

J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_k)=\frac{1}{m}\sum_{i = 1}^{m}{\Vert x^{(i)}-\mu_{c^{(i)}}}\Vert^2 \\ \min_{c^{(1)},\dots,c^{(m)}/\mu_1,\dots,\mu_K}\;J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_k)

Objective based clustering

k-means: Euclidean distance
k-median: Absolute distance
k-center: minimize the maximum radius

Initialization method

Random partition
- The random partitioning algorithm is the most used initialization technique, and after allocating each data to a random cluster, the average value of the points allocated to each cluster is set as the initial center point.
- In the case of the random partitioning technique, it is independent of the data order.
Forgy
- Forgy select random k data from the data set and set data as the initial centroid of each cluster.
- In the case of the Forgy, it is independent of the data order.
k-means++
- Rather than generating k centroids at once, it means that k clusters are created by repeating each centroid a total of k times in the direction of locating the distance between the centroids as far as possible.

Limitation

Cluster number K
- Less K

Bigger K

Solution
- Find optimal K
Local minimum
- Initial point

Final result

Solution
- K-means ++
Sensitive to outliers.
- Noisy data

Solution
- Filtering
Non-spherical data
- Non-spherical data

Solution
- DBSCAN

Cluster number - K decision

Rule of thumb

K\approx \sqrt{\frac{n}{2}} (while\;data\;number\;is\;n)

Elbow method
- Monitor the results while increasing the number of clusters sequentially.
- If adding one cluster does not give a much better result than before, set the number of clusters to the number of clusters you want to find.
- 군집내 군집과 개체간 거리 제곱합의 총합 - tot.withinss: Total within-cluster sum of squares
Silhouette Method [3]

Can be used to evaluate the validity of clustering results and to select an appropriate number of clusters .
Parameters
$a(i) = average\;dissimilarity\;of\;i\;to\;all\;other\;\\ objects\;of\;A\;(distance) \\d(i, C) = average\;dissimilarity\;of\;i\;to\;all\;\\objects\;of\;C\;(distance) \\b(i) =\min_{C\neq A} d(i,C)$

비적합도 -> d(i,C) - 모든 클러스터에 대해서 계산
The relative quality of the clusters

If the silhouette value is close to 1, it means that the distance between objects in the same cluster is closer than that of other clusters, so it means that object i is well clustered
- S(i) -> 적합도
- a(i),b(i) 비적합도
- If many objects have low or negative silhouette values, it can be determined that the cluster composition is too large or has too few clusters.

K-means++[4]

Select first centroid from data as random
Subsequent centroids adjust the sampling probability distribution so that points with a large distance from the previously selected centroid are selected with high probability. Pick a point based on this distribution.

\mu_t \rightarrow\frac{d(\mu_{t-1},\mu_t)}{\sum d(\mu_{t-1},\mu_t)}

Repeat previous step as K-1

Hierarchical clustering

Start clustering from a small unit and proceed with clustering in a bottom-up method that repeats until all data is bundled.

Bottom-up(Agglomerative)

Single link $dist(A,B)= \min_{x\in A,x^{'}\in B}d(x,x^{'})$

Complete link $dist(A,B)= \max_{x\in A,x^{'}\in B}d(x,x^{'})$

Average link $dist(A,B)= average_{x\in A,x^{'}\in B}d(x,x^{'})$

Centroid link $dist(A,B)= d(\mu_A,\mu_B)$

Ward's method $dist(A,B)= \frac{\vert A\vert \vert B\vert}{\vert A\vert +\vert B\vert}\Vert mean(A)-mean(B)\Vert^2$

Top-down(Division)

DIANA method