Clustering analysis : Key concepts and Implementation in Python

hyangki0119·2022년 3월 28일

Clustering : Introduction

What is clustering?

To partition the observations into groups(clustersclusters) so that the pairwise dissimilarities between those assigned to the same cluster tend to be smaller than those in differenct clusters

Types of clustering algorithm

Work directly on the observed data with no direct reference to an underlyning probability model
Iterative descent
K-means clustering
Supposes that the data is an i.i.d sample from some population described by a probability distribution functionMixture model
Take a nonparametric perspective, attempting to directly estimate distinct mode of the probability distribution functionPatient Rule Induction Method(PRIM),
Generalized Association Rules

Iterative greedy descent clustering

  • An initial partition is specified
  • At each iterative step, the cluster assignment are changed in such a way that the value of the criterion is improved from its previous value
  • When the prescription is unable to provide an improvement, the algorithm terminates with the current assignmnet as its solution

Example data generation

  • Make a toy data using make_blobs in sklearn!
from sklearn.datasets import make_blobs

x, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=7)
points = pd.DataFrame(x, y).reset_index(drop=True)
points.columns = ["x", "y"]

import seaborn as sns
sns.scatterplot(x="x", y="y", data=points, palette="Set2");

1. K-means

  • The K-means algorithms is one of the most popular iterative descent clustering methods


Step1Cluster assignmentChange the means of cluster
Step2A current set of meansAssigning of each observations
  1. For a given cluster assignment, the total variance is minimized with respect to {m1,,mK}\{m_{1},\dots,m_{K}\} yielding the means of the currently assigned cluster
xˉS=arg minmiSxim2\bar{x}_{S} = \argmin_{m} \sum_{i\in S} \lvert\lvert x_i - m \rvert\rvert^{2}
  1. Given a current set of means {m1,,mK}\{m_{1},\dots,m_{K}\}, the total variance is minimized by assigning each observation to the closest (current) cluster mean
C(i)=arg min1kKximk2C(i) = \argmin_{1\le k\le K} \lvert\lvert x_i - m_k \rvert\rvert^{2}
  1. Step 1 and 2 are iterated until the assignmnets do not change

Notice on intial centroids

  • One should start the algorithm with many different random choices for the starting means, abd choose the solution having smallest value of the objective fucntion

2. Gaussian mixture

EM algorithm

E stepMixture component parametersResponsibilities
M stepResponsibilitiesMixture component parameters
  • E step
    • Assigns responsibilities for each data point based in its relative density under each mixture component
  • M step
    • Recomputes the component density parameters based on the current responsibilities

  • Responsibilities
    • For given two density g0g_0 and g1g_1 and a data point xx, the relative densities
      g0(x)g0(x)+g1(x),g1(x)g0(x)+g1(x)\frac{g_0(x)}{g_0(x) + g_1(x)}, \quad\frac{g_1(x)}{g_0(x) + g_1(x)}
      are called the responsibilityresponsibility of each cluster, for this data point xx

Gaussian mixture as soft K-means clustering

  • EM algorithm is a soft version of K-means clustering, making probabilistic (rather than deterministic) assignments of points to cluster centers
  • As σ20\sigma^2\to 0, these probabilities become 0 and 1, and the two methods K-means and Gaussian mixture coincide

3. Hierachical clustering

Gap statistics

Tibishirani proposed a statistic for choosing optimal number of clusters in 2001, called gap statisticsgap\text{ }statistics:


A dendrogramdendrogram provides a highly interpretable complete description of the hierachical clustering in a graphical format.

Agglomerative vs. Divisive

Bottom upTop down
Recursively mergeRecursively split
Begin with every observation representing a singleton clusterBegin with the entire data set as a single cluster
  • Agglomerative clustering
    • Begin with every observation representing a singleton cluster
    • At each of the N1N-1 steps, the closest two cluseters are merged into a single cluster, producing one less cluster at the next higher level
    • Therefore, a measure of dissimilarity between two clusters must be defined
  • Divisive clustering
    • This approach has not been studied nearly as extensively as agglomerative methods in the clustering literature

Dissimilarity measures

Single linkagedSL(G,H)=miniGiHdiid_{SL}(G,H)=\min\limits_{\substack{i\in G \\ i^{\prime}\in H}} d_{ii^{\prime}}- Violate the compactness property
- i.e., produce clusters with very large diameters
Complete linkagedCL(G,H)=maxiGiHdiid_{CL}(G,H)=\max\limits_{\substack{i\in G \\ i^{\prime}\in H}} d_{ii^{\prime}}- Violate the closeness property
- i.e., observations assigned to a cluster can be much closer
to members of the other clusters
Group averagedGA(G,H)=1NGNHiGiHdiid_{GA}(G,H)=\frac{1}{N_G N_H}\sum\limits_{i\in G}\sum\limits_{i^{\prime}\in H}d_{ii^{\prime}}- Compromise SL and CL
- But have invariance property
Ward linkagedW(G,H)=ESS(GH)[ESS(G)ESS(H)]d_{W}(G,H)=ESS(GH)-[ESS(G)ESS(H)] where
ESS(X)=i=1NXxi1NXi=jNXxj2ESS(X) = \sum\limits_{i=1}^{N_X} \lvert x_i - \frac{1}{N_X}\sum\limits_{i=j}^{N_X}x_j\rvert^2 with

4. Self-Organizing Maps

What is Self-Organizing Maps(SOM)?

  • The SOM procedure tries to bend the plane so that the buttons(green dots) approximate the data points as well as possible
    • The plane has the prototype(usually, centroid)
  • Once the model is fit, the observations can be mapped down(projected) onto the two-dimensional grid

How to do SOM?

  • By updating the prototypes! (green dots!)
  • The observations xix_i are processed one at a time(these algorithms are called onlineonline). For each xix_i,
    1. Fine the closest prototype mjm_j to xix_i in Euclidean distance in Rp\mathbb{R}^{p}
    2. For all neighbors mkm_k of mjm_j, move mkm_k toward xix_i via the update
      mkmk+α(ximk)m_{k} \gets m_{k} + \alpha(x_i - m_k)
    • The neighbors of mjm_j are defined to be all mkm_k such that the distance between j\ell_{j} and k\ell_{k} is small
      • (NOTICE) That "distance" is defined in the integer topological coordinates of the prototypes, rather than in the feature space Rp\mathbb{R}^p
    • The simplest approach uses Euclidean distance, and "small" is determined by a threshold rr
    • The neighborhood of mjm_j always includes mjm_j itself
    • α\alpha is the learning rate
  • In summary, SOM is
    • Input
      • data
      • Dimension of rectangular grid
    • Parameters
      • α\alpha : the learning rate
      • rr : the distance threshold
    • Output
      • Fitted prototypes
      • Projected retangular grid

SOM vs. K-means

5. Spectral clustering

Traditional clustering methods like KK-means will not work well when the clusters are non-convex like below: (See Figure 14.29 in ESL p.546)

Spectral clustering is a generalization of standard clustering methods, and is designed for these situations

Clustering = graph-partition problem

Clustering can be rephrased as a graph-partition problem, with setup as below:

  • a N×NN\times N matrix of pairwise similarities sii0s_{ii^{\prime}}\ge 0 between all observation pairs
  • represent the obeservations in an undirected similarity graphundirected\text{ } similarity\text{ }graph
    • VV : the NN vertices viv_i representing the observations
    • EE : the edges connecting pairs of vertices if their similarity is positive, the edges are weighted by the siis_{ii^{\prime}}

We wish to partition the graph, such that edges between different groups have low weight, and within a group have high weight. The idea of spectral clustering is to construct similarity graphs that represent the local neighborhood relationships between observations.

Graph Laplacian

The graph Laplaciangraph\text{ }Laplacian is defined by



  • WW : the matrix of edge weights W={wii}W=\{w_{ii^{\prime}}\} from a similarity graph is called the adjacency matrixadjacency\text{ }matrix
  • GG : a diagonal matrix with diagonal elements gi=iwiig_i=\sum\limits_{i^{\prime}}w_{ii^{\prime}}, the sum of the weights of the edges connected to it. This is called degree matrixdegree\text{ }matrix

Note that this is unnormalized graph Laplacianunnormalized\text{ }graph\text{ }Laplacian. There are a number of normalized versions have been proposed - for example, L~=IG1W\tilde{L}=I-G^{-1}W.

Steps for spectral clustering

  1. Define appropriate similarity graphsimilarity\text{ }graph (i.e., vertices and edges) that reflects local behavior of observations.

  2. Get the graph Laplaciangraph\text{ }Laplacian LL correspoding to the similarity graph GG

  3. Find the mm eigenvectors ZN×mZ_{N\times m} corresponding to the mm smallest eigenvalues of LL (graph Laplaciangraph\text{ }Laplacian, defined above). (For understanding, see this link and this link). This changes the data points from left to right. (See Figure 14.29 in ESL p.546)

  4. Using a standard method like KK-means, cluster the rows of ZZ to yield a clustering of the original data points

Performance evaluation

Silhouette CoefficientCalinski-Harabasz IndexDavies-Bouldin Index
Mean of scaled differences between within-cluster distance and nearest-cluster distanceThe ratio of the between-cluster dispersion and within-cluster dispersionMean of the average similarity between each cluster and its most similar(nearest) one
For a single sample, s=bamax(a,b)s=\dfrac{b-a}{\max(a, b)}
- aa : the mean distance between a sample and all other points in the same cluster
- bb : the mean distance between a sample and all other points in the next nearest cluster

For a set of samples, the mean of the Silhouette Coefficient for each sample
For a set of data EE of size nEn_{E} which has been clustered into kk clusters, the Calinski-Harabasz score is defined as
- tr(Bk)tr(B_{k}) : the trace of the between group dispersion matrix
- tr(Wk)tr(W_{k}) : the trace of the within-cluster dispersion matrix
Wk=q=1Kxcq(xcq)(xcq)W_{k}=\sum\limits_{q=1}^{K}\sum\limits_{x\in c_{q}}(x-c_{q})(x-c_{q})^{\top}
- CqC_{q} : the set of points in cluster qq
- cqc_q : the center of cluster qq
- cEc_{E} : the center of EE
The Davies-Bouldin index is defined as
DB=1Ki=1KmaxijRijDB=\dfrac{1}{K}\sum\limits_{i=1}^{K}\max\limits_{i\ne j} R_{ij}
Rij=si+sjdijR_{ij}=\dfrac{s_i + s_j}{d_{ij}}
- sis_i : the average distance between each point of cluster ii and the centroid of that cluster
- dijd_{ij} : the distance between cluster centroid ii and jj
  • Silhouette Coefficient
    • diff(within cluster distance, nearest cluster distance)
      • nearest cluster distance = min(between cluster distance)
    • For a given single sample point xx,
      • within cluster distance a=mean(a1,a2,...)a = mean(a1,a2,...)
      • nearest cluster distance b=mean(b1,b2,...)b = mean(b1,b2,...)
    • (The Silhouette Coefficient for a set of samples)= mean(the Silhouette Coefficient for each sample)

Choose optimal K


Data science & Machine learning, baking and reading(≪,≫)

0개의 댓글