Mini-Batch K-Means is a variant of the K-Means clustering algorithm that aims to reduce the computational cost by using small, random samples of the data (mini-batches) for each iteration of the algorithm, instead of the full dataset. This approach significantly speeds up the convergence time and is particularly effective for large datasets, making it a popular choice for scalable machine learning applications.
K-Means is a widely used clustering algorithm that partitions a dataset into (K) clusters by minimizing the variance within each cluster. The Mini-Batch K-Means algorithm introduces an optimization by processing small subsets of the dataset in each iteration, which drastically reduces the amount of computation needed to converge to a solution.
The objective function of the K-Means algorithm is to minimize the sum of squared distances between each data point and the centroid of its assigned cluster. For a dataset with data points, the objective function is defined as:
where represents the set of points in cluster , is the centroid of cluster , and is the squared Euclidean distance between point and centroid .
The centroid update in Mini-Batch K-Means is performed using an incremental update formula to account for the mini-batch processing. The new centroid after processing a mini-batch is calculated as:
where is the mean of the points in the mini-batch assigned to cluster , and is the learning rate, which controls the extent to which the centroids are updated. The learning rate may decrease over time to ensure convergence.
n_clusters
: int
batch_size
: int
, default = 100max_iter
: int
, default = 100Test on reduced wine dataset and find optimal number of clusters via inertia elbow plot:
from luma.preprocessing.scaler import StandardScaler
from luma.reduction.linear import PCA
from luma.clustering.kmeans import MiniBatchKMeansClustering
from luma.metric.clustering import Inertia
from luma.visual.evaluation import ClusterPlot
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
import numpy as np
X, y = load_wine(return_X_y=True)
sc = StandardScaler()
X_std = sc.fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_std)
scores, labels, models = [], [], []
for i in range(2, 10):
km = MiniBatchKMeansClustering(n_clusters=i,
batch_size=100,
max_iter=100)
km.fit(X_pca)
scores.append(Inertia.score(X_pca, km.centroids))
labels.append(km.labels)
models.append(km)
def derivate_inertia(scores: list) -> list:
diff = [0.0]
for i in range(1, len(scores) - 1):
diff.append(scores[i - 1] - scores[i + 1] / 2)
diff.append(0.0)
return diff
d_scores = derivate_inertia(scores)
best_n = np.argmax(d_scores)
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
clst = ClusterPlot(models[best_n], X_pca, cmap='viridis')
clst.plot(ax=ax1)
ax2.plot(range(2, 10), scores, marker='o', label='Inertia')
ax2.plot(range(2, 10), d_scores, marker='o', label='Absolute Derivative')
ax2.set_title('Inertia Plot')
ax2.set_xlabel('Number of Clusters')
ax2.set_ylabel('Inertia')
ax2.legend()
ax2.grid()
plt.tight_layout()
plt.show()
- Sculley, David. "Web-scale k-means clustering." Proceedings of the 19th international conference on World wide web. 2010.
- Bradley, Paul S., Usama M. Fayyad, and Cory A. Reina. "Scaling clustering algorithms to large databases." Proceedings of the fourth international conference on knowledge discovery and data mining, AAAI Press, 1998.