Clustering : Introduction
What is clustering?
To partition the observations into groups(clusters) so that the pairwise dissimilarities between those assigned to the same cluster tend to be smaller than those in differenct clusters
Types of clustering algorithm
Type | Description | Example |
---|
Combinatorial algorithms | Work directly on the observed data with no direct reference to an underlyning probability model | |
Iterative descent clustering | | K-means clustering |
Mixture modeling | Supposes that the data is an i.i.d sample from some population described by a probability distribution function | Mixture model |
Mode seekers | Take a nonparametric perspective, attempting to directly estimate distinct mode of the probability distribution function | Patient Rule Induction Method(PRIM), Generalized Association Rules |
Hierachical clustering | | |
Iterative greedy descent clustering
- An initial partition is specified
- At each iterative step, the cluster assignment are changed in such a way that the value of the criterion is improved from its previous value
- When the prescription is unable to provide an improvement, the algorithm terminates with the current assignmnet as its solution
Example data generation
- Make a toy data using
make_blobs
in sklearn
!
from sklearn.datasets import make_blobs
x, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=7)
points = pd.DataFrame(x, y).reset_index(drop=True)
points.columns = ["x", "y"]
points.head()
import seaborn as sns
sns.scatterplot(x="x", y="y", data=points, palette="Set2");
1. K-means
- The K-means algorithms is one of the most popular iterative descent clustering methods
Algorithm
| Given | Change |
---|
Step1 | Cluster assignment | Change the means of cluster |
Step2 | A current set of means | Assigning of each observations |
- For a given cluster assignment, the total variance is minimized with respect to {m1,…,mK} yielding the means of the currently assigned cluster
xˉS=margmini∈S∑∣∣xi−m∣∣2
- Given a current set of means {m1,…,mK}, the total variance is minimized by assigning each observation to the closest (current) cluster mean
C(i)=1≤k≤Kargmin∣∣xi−mk∣∣2
- Step 1 and 2 are iterated until the assignmnets do not change
Notice on intial centroids
- One should start the algorithm with many different random choices for the starting means, abd choose the solution having smallest value of the objective fucntion
2. Gaussian mixture
EM algorithm
Step | Given | Update |
---|
E step | Mixture component parameters | Responsibilities |
M step | Responsibilities | Mixture component parameters |
- E step
- Assigns responsibilities for each data point based in its relative density under each mixture component
- M step
- Recomputes the component density parameters based on the current responsibilities
- Responsibilities
- For given two density g0 and g1 and a data point x, the relative densities
g0(x)+g1(x)g0(x),g0(x)+g1(x)g1(x) are called the responsibility of each cluster, for this data point x
Gaussian mixture as soft K-means clustering
- EM algorithm is a soft version of K-means clustering, making probabilistic (rather than deterministic) assignments of points to cluster centers
- As σ2→0, these probabilities become 0 and 1, and the two methods K-means and Gaussian mixture coincide
3. Hierachical clustering
Gap statistics
Tibishirani proposed a statistic for choosing optimal number of clusters in 2001, called gap statistics:
Dendrogram
A dendrogram provides a highly interpretable complete description of the hierachical clustering in a graphical format.
Agglomerative vs. Divisive
Agglomerative | Divisive |
---|
Bottom up | Top down |
Recursively merge | Recursively split |
Begin with every observation representing a singleton cluster | Begin with the entire data set as a single cluster |
- Agglomerative clustering
- Begin with every observation representing a singleton cluster
- At each of the N−1 steps, the closest two cluseters are merged into a single cluster, producing one less cluster at the next higher level
- Therefore, a measure of dissimilarity between two clusters must be defined
- Divisive clustering
- This approach has not been studied nearly as extensively as agglomerative methods in the clustering literature
Dissimilarity measures
Type | Formulation | Description |
---|
Single linkage | dSL(G,H)=i∈Gi′∈Hmindii′ | - Violate the compactness property - i.e., produce clusters with very large diameters |
Complete linkage | dCL(G,H)=i∈Gi′∈Hmaxdii′ | - Violate the closeness property - i.e., observations assigned to a cluster can be much closer to members of the other clusters |
Group average | dGA(G,H)=NGNH1i∈G∑i′∈H∑dii′ | - Compromise SL and CL - But have invariance property |
Ward linkage | dW(G,H)=ESS(GH)−[ESS(G)ESS(H)] where ESS(X)=i=1∑NX∣xi−NX1i=j∑NXxj∣2 with | |
4. Self-Organizing Maps
What is Self-Organizing Maps(SOM)?
- The SOM procedure tries to bend the plane so that the buttons(green dots) approximate the data points as well as possible
- The plane has the prototype(usually, centroid)
- Once the model is fit, the observations can be mapped down(projected) onto the two-dimensional grid
How to do SOM?
- By updating the prototypes! (green dots!)
- The observations xi are processed one at a time(these algorithms are called online). For each xi,
- Fine the closest prototype mj to xi in Euclidean distance in Rp
- For all neighbors mk of mj, move mk toward xi via the update
mk←mk+α(xi−mk)
- The neighbors of mj are defined to be all mk such that the distance between ℓj and ℓk is small
- (NOTICE) That "distance" is defined in the integer topological coordinates of the prototypes, rather than in the feature space Rp
- The simplest approach uses Euclidean distance, and "small" is determined by a threshold r
- The neighborhood of mj always includes mj itself
- α is the learning rate
- In summary, SOM is
- Input
- data
- Dimension of rectangular grid
- Parameters
- α : the learning rate
- r : the distance threshold
- Output
- Fitted prototypes
- Projected retangular grid
- https://www.kaggle.com/code/izzettunc/introduction-to-time-series-clustering/notebook
SOM vs. K-means
5. Spectral clustering
Traditional clustering methods like K-means will not work well when the clusters are non-convex like below: (See Figure 14.29 in ESL p.546)
Spectral clustering is a generalization of standard clustering methods, and is designed for these situations
Clustering = graph-partition problem
Clustering can be rephrased as a graph-partition problem, with setup as below:
- a N×N matrix of pairwise similarities sii′≥0 between all observation pairs
- represent the obeservations in an undirected similarity graph
- V : the N vertices vi representing the observations
- E : the edges connecting pairs of vertices if their similarity is positive, the edges are weighted by the sii′
We wish to partition the graph, such that edges between different groups have low weight, and within a group have high weight. The idea of spectral clustering is to construct similarity graphs that represent the local neighborhood relationships between observations.
Graph Laplacian
The graph Laplacian is defined by
where
- W : the matrix of edge weights W={wii′} from a similarity graph is called the adjacency matrix
- G : a diagonal matrix with diagonal elements gi=i′∑wii′, the sum of the weights of the edges connected to it. This is called degree matrix
Note that this is unnormalized graph Laplacian. There are a number of normalized versions have been proposed - for example, L~=I−G−1W.
Steps for spectral clustering
-
Define appropriate similarity graph (i.e., vertices and edges) that reflects local behavior of observations.
-
Get the graph Laplacian L correspoding to the similarity graph G
-
Find the m eigenvectors ZN×m corresponding to the m smallest eigenvalues of L (graph Laplacian, defined above). (For understanding, see this link and this link). This changes the data points from left to right. (See Figure 14.29 in ESL p.546)
-
Using a standard method like K-means, cluster the rows of Z to yield a clustering of the original data points
| Silhouette Coefficient | Calinski-Harabasz Index | Davies-Bouldin Index |
---|
| | | |
| Mean of scaled differences between within-cluster distance and nearest-cluster distance | The ratio of the between-cluster dispersion and within-cluster dispersion | Mean of the average similarity between each cluster and its most similar(nearest) one |
| For a single sample, s=max(a,b)b−a where - a : the mean distance between a sample and all other points in the same cluster - b : the mean distance between a sample and all other points in the next nearest cluster
For a set of samples, the mean of the Silhouette Coefficient for each sample | For a set of data E of size nE which has been clustered into k clusters, the Calinski-Harabasz score is defined as s=tr(Wk)tr(Bk)×k−1nE−k where - tr(Bk) : the trace of the between group dispersion matrix - tr(Wk) : the trace of the within-cluster dispersion matrix Wk=q=1∑Kx∈cq∑(x−cq)(x−cq)⊤ Bk=q=1∑Knq(cq−cE)(cq−cE)⊤ with - Cq : the set of points in cluster q - cq : the center of cluster q - cE : the center of E | The Davies-Bouldin index is defined as DB=K1i=1∑Ki=jmaxRij where Rij=dijsi+sj with - si : the average distance between each point of cluster i and the centroid of that cluster - dij : the distance between cluster centroid i and j |
- Silhouette Coefficient
- diff(within cluster distance, nearest cluster distance)
- nearest cluster distance = min(between cluster distance)
- For a given single sample point x,
- within cluster distance a=mean(a1,a2,...)
- nearest cluster distance b=mean(b1,b2,...)
- (The Silhouette Coefficient for a set of samples)= mean(the Silhouette Coefficient for each sample)
Choose optimal K
[References]