Divisive clustering, also known as DIvisive ANAlysis clustering (DIANA), is a top-down clustering method used in machine learning and data mining to identify homogeneous groups within a dataset based on similarity measures. Unlike agglomerative clustering that starts with individual data points and merges them into larger clusters, divisive clustering begins with all data points in a single cluster and iteratively divides it into smaller clusters until each cluster is homogenous or until it meets a specified number of clusters.
The divisive clustering algorithm is based on the concept of maximizing inter-cluster distance while minimizing intra-cluster distance. The process involves identifying the most dissimilar or distant data points in a cluster and using them as seeds to split the cluster into subclusters. This approach is akin to performing a series of binary splits, where at each step, the cluster with the highest heterogeneity is divided.
The divisive method is considered more straightforward in its objective compared to agglomerative clustering, as it clearly aims to split the most diverse cluster at each step. However, it is computationally more intensive, as it requires evaluating the potential for division across all clusters at each iteration.
The mathematical formulation of divisive clustering involves measuring distances or dissimilarities between data points. A common distance measure used is the Euclidean distance for quantitative data, defined as:
where is the distance between points and , and and are the values of the attribute for data points and , respectively.
The algorithm then seeks to maximize the distance between clusters while minimizing the distance within clusters. This can be quantitatively expressed through various criteria, such as the silhouette score or Dunn index, to evaluate the quality of the clustering.
n_clusters
: int
Test on synthesized dataset of three blobs:
from luma.clustering.hierarchy import DivisiveClustering
from luma.visual.evaluation import ClusterPlot
from luma.metric.clustering import SilhouetteCoefficient
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
X, y = make_blobs(n_samples=300,
centers=3,
cluster_std=1.5,
random_state=10)
div = DivisiveClustering(n_clusters=3)
div.fit(X)
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
clst = ClusterPlot(div, X, cmap='spring')
clst.plot(ax=ax1)
sil = SilhouetteCoefficient(X, div.labels)
sil.plot(ax=ax2, show=True)
Divisive clustering is utilized in a wide array of applications, including:
- Kaufman, Leonard, and Peter J. Rousseeuw. "Finding Groups in Data: An Introduction to Cluster Analysis." John Wiley & Sons, 1990.
- Jain, Anil K., and Richard C. Dubes. "Algorithms for Clustering Data." Prentice-Hall, Inc., 1988.