Clustering is
such that those within each cluster are more closely related to one another than objects in others. Note that clustering
In this post, we will explore the followings:
Bring your packages to do cluster analysis and make plots in python:
import pandas as pd
import numpy as np
import math
import scipy as sp
import matplotlib.pyplot as plt
To make a toy example data set, make_blobs
in sklearn.datasets
is used
make_blobs
generate blobs for clusteringrandom_state
and choose an appropriate from sklearn.datasets import make_blobs
x, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=7)
points = pd.DataFrame(x, y).reset_index(drop=True)
points.columns = ["x", "y"]
points.head()
Check your generated toy using plot below: It is so pretty!
import seaborn as sns
sns.scatterplot(x="x", y="y", data=points, palette="Set2");
For a pretty visualization, I'd like to have three color palettes in advance:
import matplotlib.pyplot as plt
def get_cmap(n, name='viridis'):
'''Returns a function that maps each index in 0, 1, ..., n-1 to a distinct
RGB color; the keyword argument name must be a standard mpl colormap name.'''
return plt.cm.get_cmap(name, n)
cmap = get_cmap(3)
The - clustering is
A practical issue of K-means clustering on initial values:
from sklearn.cluster import KMeans
model = KMeans(n_clusters = 3, random_state = 10)
model.fit(points)
kmeans_labels = model.fit_predict(points)
points['cluster'] = kmeans_labels
import matplotlib.pyplot as plt
fig = plt.figure()
for i in range(len(points)):
plt.plot([points['x'][i]], [points['y'][i]], marker='o', color=cmap(points['cluster'][i]))
The - clustering procedure is closely related to the algorithm for estimating a certain Gaussian mixture model
For more on clustering using Gaussian mixture, go to here.
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(points)
points['cluster'] = gmm_labels
import matplotlib.pyplot as plt
fig = plt.figure()
for i in range(len(points)):
plt.plot([points['x'][i]], [points['y'][i]], marker='o', color=cmap(points['cluster'][i]))
- clustering algorithm depends on
In contrast, methods do not require such specifications. Instead, they require
Strategies for hierachical clustering divide into two basic paradigms:
For more detailed description, you can see this.
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 6))
plt.title("Customer Dendograms")
dend = shc.dendrogram(shc.linkage(points, method='ward'))
from sklearn.cluster import AgglomerativeClustering
hierachical_fit = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
hac_labels = hierachical_fit.fit_predict(points)
points['cluster'] = hac_labels
import matplotlib.pyplot as plt
fig = plt.figure()
for i in range(len(points)):
plt.plot([points['x'][i]], [points['y'][i]], marker='o', color=cmap(points['cluster'][i]))
The Self-Organizing Maps(SOM) procedure
The original SOM algorithm was
Begin in Python! Package minisom
do this.
MiniSom
requires data in array form, not in dataframepoints_dataset = []
for i in range(points.shape[0]):
point_series = np.array(points.iloc[i,:])
points_dataset.append(point_series)
from minisom import MiniSom
som = MiniSom(3, 1, # initialization of 3x3 SOM
len(points_dataset[0]), # length of dataset
sigma=0.3,
learning_rate = 0.1)
som.random_weights_init(points_dataset)
som.train(points_dataset, 1000)
win_map = som.win_map(points_dataset)
import matplotlib.pyplot as plt
fig = plt.figure()
for i, (x, y) in enumerate(points_dataset):
fitted_ = som.winner(points_dataset[i])
cluster_number = (fitted_[0] * 1) + fitted_[1]
plt.plot([x], [y], marker='o', color=cmap(cluster_number))
Traditional clustering method like K-means
is designed for
from sklearn.cluster import SpectralClustering
sc = SpectralClustering(n_clusters=3).fit(points)
sc_labels = sc.labels_
points['cluster'] = sc_labels
import matplotlib.pyplot as plt
fig = plt.figure()
for i in range(len(points)):
plt.plot([points['x'][i]], [points['y'][i]], marker='o', color=cmap(points['cluster'][i]))
Clustering is the most representative
For clustering, there are two types of evaluation criteria
Before compute out evaluation scores, we import metric
from sklearn
from sklearn import metrics
Also, our dataset points
is a pd.DataFrame
like below
from sklearn import metrics
metrics.silhouette_score(points[["x", "y"]], np.ravel(points[["cluster"]]))
from sklearn import metrics
metrics.silhouette_score(points[["x", "y"]], np.ravel(points[["cluster"]]))
from sklearn import metrics
metrics.silhouette_score(points[["x", "y"]], np.ravel(points[["cluster"]]))