๐Ÿ’  AIchemist 9th Session | ๊ตฐ์ง‘ํ™”

yellowsubmarine372ยท2023๋…„ 11์›” 26์ผ

AIchemist

๋ชฉ๋ก ๋ณด๊ธฐ
11/14
post-thumbnail

01. ๊ตฐ์ง‘ํ™” ๊ฐœ๋…

๊ตฐ์ง‘ํ™”๋Š” ๋น„์ง€๋„ ํ•™์Šต์— ์†ํ•œ๋‹ค.

๊ตฐ์ง‘ํ™”

  • ๋น„์Šทํ•œ ์ƒ˜ํ”Œ์„ ํ•˜๋‚˜์˜ ํด๋Ÿฌ์Šคํ„ฐ๋กœ ๋ชจ์œผ๋Š” ๊ฒƒ
  • ์ฐจ์› ์ถ•์†Œ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ถ„์„์„ ์œ„ํ•œ ์ถฉ๋ถ„ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ
  • ์ด์ƒ์น˜ ํƒ์ง€ ๊ฐ€๋Šฅ !

ํ•˜๋“œ ๊ตฐ์ง‘ vs ์†Œํ”„ํŠธ ๊ตฐ์ง‘

์†Œํ”„ํŠธ ๊ตฐ์ง‘์€ ๊ฐ์ฒด๊ฐ€ ์–ด๋А ๊ตฐ์ง‘์— ์†ํ• ์ง€๋ฅผย ๊ฐ€์ค‘์น˜(weight)๋‚˜ ํ™•๋ฅ (probability)๋กœ์„œ ๊ฐ€๋Šฅ์„ฑ ์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ด์ฃผ๋Š” ๊ธฐ๋ฒ•

02. K-ํ‰๊ท  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ดํ•ด

  1. ๊ตฐ์ง‘ํ™”์—์„œ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜
  2. ๊ตฐ์ง‘ ์ค‘์‹ฌ์ (centroid)์ด๋ผ๋Š” ํŠน์ •ํ•œ ์ž„์˜์˜ ์ง€์ ์„ ์„ ํƒํ•ด ํ•ด๋‹น ์ค‘์‹ฌ์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ํฌ์ธํŠธ๋“ค์„ ์„ ํƒํ•˜๋Š” ๊ตฐ์ง‘ํ™” ๊ธฐ๋ฒ•
  3. ์„ ํƒ๋œ ํฌ์ธํŠธ์˜ ํ‰๊ท ์ง€์ ์œผ๋กœ ์ด๋™ํ•˜๊ณ  ์ด๋™๋œ ์ค‘์‹ฌ์ ์—์„œ ๋‹ค์‹œ ๊ฐ€๊นŒ์šด ํฌ์ธํŠธ ์„ ํƒ, ๋‹ค์‹œ ์ค‘์‹ฌ์ ์„ ํ‰๊ท ์ง€์ ์œผ๋กœ ์ด๋™ํ•˜๋Š” ํ”„๋กœ์„ธ์Šค ๋ฐ˜๋ณต ์ˆ˜ํ–‰
  4. ๊ฐ cluster ๋‚ด ์œ ์‚ฌ๋„ ๋†’์ด๊ณ  ์™ธ ์œ ์‚ฌ๋„๋Š” ๋‚ฎ์ถ”๋Š” ๊ฒƒ์„ ๊ฐ€์ •์œผ๋กœ ๊ฐ cluster ๊ฑฐ๋ฆฌ ์ฐจ์ด์˜ ๋ถ„์‚ฐ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ 

(1) ๋ช‡ ๊ฐœ์˜ ๋ฉ์–ด๋ฆฌ๋กœ clusteringํ• ์ง€ ์ •ํ•œ๋‹ค

(2) 1์—์„œ ์ •ํ•œ ๊ฐœ์ˆ˜๋งŒํผ ์ค‘์‹ฌ์ ์„ ์ •ํ•œ๋‹ค (์ž์‹ ์ด ์›ํ•˜๋Š” ์•„๋ฌด ๊ฐ’์œผ๋กœ ์ค‘์‹ฌ์ ์„ ์ •ํ•˜๊ธฐ)

(3) ๊ฐ ์ ์— ๋Œ€ํ•ด ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด centroid๋ฅผ ์ •ํ•œ๋‹ค

(4) ์ด์ œ ๊ฐ ๋งคํ•‘๋œ ์ ๋“ค์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•˜์—ฌ centroid๋ฅผ ์ด๋™ํ•œ๋‹ค

(5) 5. 3-4์˜ ๊ณผ์ •์„ ๋”์ด์ƒ ์ƒˆ๋กœ ๋งคํ•‘๋˜์ง€ ์•Š์„ ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•œ๋‹ค

(6) ์ƒˆ๋กœ ๋งคํ•‘๋œ ์ ๋“ค์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์‹œ centroid๋ฅผ ์ด๋™ํ•œ๋‹ค

(7) ์œ„์˜ ๊ณผ์ •์„ ๋”์ด์ƒ ์ด๋™์ด ์—†์„ ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต

kmean parameter

kํ‰๊ท ์„ ์ด์šฉํ•œ ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๊ตฐ์ง‘ํ™”

๊ฝƒ๋ฐ›์นจ, ๊ฝƒ์žŽ์˜ ๊ธธ์ด์— ๋”ฐ๋ผ ๊ฐ ๋ฐ์ดํ„ฐ์˜ ๊ตฐ์ง‘ํ™”๊ฐ€ ์–ด๋–ป๊ฒŒ ๊ฒฐ์ •๋˜๋Š” ์ง€ ํ™•์ธ, ์ด๋ฅผ ๋ถ„๋ฅ˜๊ฐ’๊ณผ ๋น„๊ต

  • ์ดˆ๊ธฐ ์ค‘์‹ฌ ์„ค์ • ๋ฐฉ์‹์€ k-means++
  • target ์นผ๋Ÿผ, labels_์นผ๋Ÿผ์€ cluster ์˜๋ฏธ
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300,random_state=0)
kmeans.fit(irisDF)

irisDF['target'] = iris.target
irisDF['cluster']=kmeans.labels_
iris_result = irisDF.groupby(['target','cluster'])['sepal_length'].count()
print(iris_result)

๊ตฐ์ง‘ํ™” ์‹œ๊ฐํ™”

  • PCA ์ด์šฉํ•ด 4๊ฐœ์˜ ์†์„ฑ์„ 2๊ฐœ๋กœ ์ฐจ์› ์ถ•์†Œํ•œ ๋’ค x์ขŒํ‘œ, y์ขŒํ‘œ๋กœ ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_transformed = pca.fit_transform(iris.data)

irisDF['pca_x'] = pca_transformed[:,0]
irisDF['pca_y'] = pca_transformed[:,1]
irisDF.head(3)

# ๊ตฐ์ง‘ ๊ฐ’์ด 0, 1, 2์ธ ๊ฒฝ์šฐ๋งˆ๋‹ค ๋ณ„๋„์˜ ์ธ๋ฑ์Šค๋กœ ์ถ”์ถœ
marker0_ind = irisDF[irisDF['cluster']==0].index
marker1_ind = irisDF[irisDF['cluster']==1].index
marker2_ind = irisDF[irisDF['cluster']==2].index

# ๊ตฐ์ง‘ ๊ฐ’ 0, 1, 2์— ํ•ด๋‹นํ•˜๋Š” ์ธ๋ฑ์Šค๋กœ ๊ฐ ๊ตฐ์ง‘ ๋ ˆ๋ฒจ์˜ pca_x, pca_y ๊ฐ’ ์ถ”์ถœ. o, s, ^ ๋กœ ๋งˆ์ปค ํ‘œ์‹œ
plt.scatter(x=irisDF.loc[marker0_ind, 'pca_x'], y=irisDF.loc[marker0_ind, 'pca_y'], marker='o')
plt.scatter(x=irisDF.loc[marker1_ind, 'pca_x'], y=irisDF.loc[marker1_ind, 'pca_y'], marker='s')
plt.scatter(x=irisDF.loc[marker2_ind, 'pca_x'], y=irisDF.loc[marker2_ind, 'pca_y'], marker='^')

plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title('3 Clusters Visualization by 2 PCA Components')
plt.show()

๊ตฐ์ง‘ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ…Œ์ŠคํŠธ๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ

๊ตฐ์ง‘ํ™”์šฉ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ๊ธฐ
make_blobs()
๊ฐœ๋ณ„ ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ์ ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ ์ œ์–ด ๊ธฐ๋Šฅ ์ถ”๊ฐ€ > ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ์™€ ํƒ€๊นƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ํŠœํ”Œ๋กœ ๋ฐ˜ํ™˜
make_classification()
๋…ธ์ด์ฆˆ๋ฅผ ํฌํ•จํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐ ์œ ์šฉ

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
%matplotlib inline

X, y = make_blobs(n_samples=200, n_features=2, centers=3, cluster_std=0.8, random_state=0)
print(X.shape, y.shape)

# y target ๊ฐ’์˜ ๋ถ„ํฌ๋ฅผ ํ™•์ธ
unique, counts = np.unique(y, return_counts=True)
print(unique,counts)
  • n_samples: ์ƒ์„ฑํ•  ๋ฐ์ดํ„ฐ ์ด ๊ฐœ์ˆ˜
  • n_features : ๋ฐ์ดํ„ฐ ํ”ผํ„ฐ ๊ฐœ์ˆ˜
  • centers : ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜

ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ์–ด๋– ํ•œ ๊ตฐ์ง‘ํ™” ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ณ  ๋งŒ๋“ค์–ด์กŒ๋Š”์ง€ ํ™•์ธ

target_list = np.unique(y)
# ๊ฐ ํƒ€๊นƒ๋ณ„ ์‚ฐ์ ๋„์˜ ๋งˆ์ปค ๊ฐ’.
markers=['o', 's', '^', 'P', 'D', 'H', 'x']
# 3๊ฐœ์˜ ๊ตฐ์ง‘ ์˜์—ญ์œผ๋กœ ๊ตฌ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ƒ์„ฑํ–ˆ์œผ๋ฏ€๋กœ target_list๋Š” [0, 1, 2]
# target==0, target==1, target==2 ๋กœ scatter plot์„ marker๋ณ„๋กœ ์ƒ์„ฑ.
for target in target_list:
    target_cluster = clusterDF[clusterDF['target']==target]
    plt.scatter(x=target_cluster['ftr1'], y=target_cluster['ftr2'], edgecolor='k',
                marker=markers[target] )

plt.show()

KMeans ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•œ ๋’ค์— ๊ตฐ์ง‘๋ณ„๋กœ ์‹œ๊ฐํ™”

  • cluster_centers_ ์€ ๊ฐœ๋ณ„ ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ ์œ„์น˜ ์ขŒํ‘œ ๋‚˜ํƒ€๋ƒ„
# KMeans ๊ฐ์ฒด๋ฅผ ์ด์šฉํ•˜์—ฌ X ๋ฐ์ดํ„ฐ๋ฅผ K-Means ํด๋Ÿฌ์Šคํ„ฐ๋ง ์ˆ˜ํ–‰ 
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=200, random_state=0)
cluster_labels = kmeans.fit_predict(X)
clusterDF['kmeans_label']  = cluster_labels

#cluster_centers_ ๋Š” ๊ฐœ๋ณ„ ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ค‘์‹ฌ ์œ„์น˜ ์ขŒํ‘œ ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•ด ์ถ”์ถœ
centers = kmeans.cluster_centers_
unique_labels = np.unique(cluster_labels)
markers=['o', 's', '^', 'P','D','H','x']

# ๊ตฐ์ง‘๋œ label ์œ ํ˜•๋ณ„๋กœ iteration ํ•˜๋ฉด์„œ marker ๋ณ„๋กœ scatter plot ์ˆ˜ํ–‰. 
for label in unique_labels:
    label_cluster = clusterDF[clusterDF['kmeans_label']==label]
    center_x_y = centers[label]
    plt.scatter(x=label_cluster['ftr1'], y=label_cluster['ftr2'], edgecolor='k', 
                marker=markers[label] )
    
    # ๊ตฐ์ง‘๋ณ„ ์ค‘์‹ฌ ์œ„์น˜ ์ขŒํ‘œ ์‹œ๊ฐํ™” 
    plt.scatter(x=center_x_y[0], y=center_x_y[1], s=200, color='white',
                alpha=0.9, edgecolor='k', marker=markers[label])
    plt.scatter(x=center_x_y[0], y=center_x_y[1], s=70, color='k', edgecolor='k', 
                marker='$%d$' % label)

plt.show()

cluster_std๊ฐ€ ์ž‘์„ ์ˆ˜๋ก ๊ตฐ์ง‘ ์ค‘์‹ฌ์— ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ์—ฌ ์žˆ์œผ๋ฉฐ, ํด์ˆ˜๋ก ๋ฐ์ดํ„ฐ๊ฐ€ ํผ์ ธ์žˆ์Œ

03. ๊ตฐ์ง‘ ํ‰๊ฐ€(Cluster Evaluation)

๋Œ€๋ถ€๋ถ„ ๊ตฐ์ง‘ํ™” ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ํƒ€๊นƒ๋ ˆ์ด๋ธ”์„ ๊ฐ€์ง€์ง€ ์žˆ์ง€ ์•Š๊ณ  ๊ตฐ์ง‘ํ™”๋Š” ๋ถ„๋ฅ˜์™€ ์œ ์‚ฌํ•ด ๋ณด์ผ ์ˆ˜ ์žˆ์œผ๋‚˜ ์„ฑ๊ฒฉ์ด ๋‹ค๋ฆ„. ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ„๋ฅ˜ ๊ฐ’์˜ ๋ฐ์ดํ„ฐ๋„ ๋” ๋„“์€ ๊ตฐ์ง‘ํ™” ๋ ˆ๋ฒจํ™” ๋“ฑ์˜ ์˜์—ญ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ.

๊ตฐ์ง‘ํ™” ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ์‹ค๋ฃจ์—ฃ ๋ถ„์„์„ ์ด์šฉ

์‹ค๋ฃจ์—ฃ ๋ถ„์„

์‹ค๋ฃจ์—ฃ ๋ถ„์„์€ ๊ฐ ๊ตฐ์ง‘ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ์–ผ๋งˆ๋‚˜ ํšจ์œจ์ ์œผ๋กœ ๋ถ„๋ฆฌ๋ผ ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„
์‹ค๋ฃจ์—ฃ ๋ถ„์„์€ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์‹œํ–‰
ํ•ด๋‹น ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ™์€ ๊ตฐ์ง‘ ๋‚ด์˜ ๋ฐ์ดํ„ฐ์™€ ์–ผ๋งˆ๋‚˜ ๊ฐ€๊น๊ฒŒ ๊ตฐ์ง‘ํ™”๋ผ ์žˆ๊ณ , ๋‹ค๋ฅธ ๊ตฐ์ง‘์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ์™€๋Š” ์–ผ๋งˆ๋‚˜ ๋ฉ€๋ฆฌ ๋ถ„๋ฆฌ๋ผ ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ง€ํ‘œ

์‹ค๋ฃจ์—ฃ ๋ถ„์„ ๋ฉ”์„œ๋“œ
.silhouette_samples
๊ฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐ ํ•ด ๋ฐ˜ํ™˜
.silhouette_score
์ „์ฒด ๋ฐ์ดํ„ฐ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’์„ ํ‰๊ท ํ•ด ๋ฐ˜ํ™˜

+ ๊ตฐ์ง‘๋ณ„ ํ‰๊ท  ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’ = .groupby('cluster)['silhouette_coeff'].mean()

์ข‹์€ ๊ตฐ์ง‘ํ™” ๋งŒ์กฑ ๊ธฐ์ค€

  • ์ „์ฒด ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜์˜ ํ‰๊ท ๊ฐ’์€ 1์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ์ข‹์Œ
  • ๊ฐœ๋ณ„ ๊ตฐ์ง‘์˜ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ํ‰๊ท ๊ฐ’์ด ์ „์ฒด ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜์˜ ํ‰๊ท ๊ฐ’์—์„œ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋Š”๊ฒŒ ์ค‘์š”

๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ด์šฉํ•œ ๊ตฐ์ง‘ ํ‰๊ฐ€

from sklearn.preprocessing import scale
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
# ์‹ค๋ฃจ์—ฃ ๋ถ„์„ metric ๊ฐ’์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ API ์ถ”๊ฐ€
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

iris = load_iris()
feature_names = ['sepal_length','sepal_width','petal_length','petal_width']
irisDF = pd.DataFrame(data=iris.data, columns=feature_names)
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300,random_state=0).fit(irisDF)

irisDF['cluster'] = kmeans.labels_

# iris ์˜ ๋ชจ๋“  ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ์— ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๊ฐ’์„ ๊ตฌํ•จ. 
score_samples = silhouette_samples(iris.data, irisDF['cluster'])
print('silhouette_samples( ) return ๊ฐ’์˜ shape' , score_samples.shape)

# irisDF์— ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ์ปฌ๋Ÿผ ์ถ”๊ฐ€
irisDF['silhouette_coeff'] = score_samples

# ๋ชจ๋“  ๋ฐ์ดํ„ฐ์˜ ํ‰๊ท  ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๊ฐ’์„ ๊ตฌํ•จ. 
average_score = silhouette_score(iris.data, irisDF['cluster'])
print('๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ์…‹ Silhouette Analysis Score:{0:.3f}'.format(average_score))

irisDF.head(3)

ํ‰๊ท  ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๊ฐ’์€ ๋‚ฎ์€ ๋ฐ˜๋ฉด์— 1๋ฒˆ ๊ตฐ์ง‘์˜ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๋Š” ๋†’์€ ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ„

๊ตฐ์ง‘๋ณ„ ํ‰๊ท  ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜์˜ ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•œ ๊ตฐ์ง‘ ๊ฐœ์ˆ˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•

ํŠน์ • ๊ตฐ์ง‘ ๋‚ด์˜ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๊ฐ’๋งŒ ๋„ˆ๋ฌด ๋†’๊ณ , ๋‹ค๋ฅธ ๊ตฐ์ง‘์˜ ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ ๋ผ๋ฆฌ ๋„ˆ๋ฌด ๊ฑฐ๋ฆฌ๊ฐ€ ๋–จ์–ด์ ธ ์žˆ์–ด ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’์ด ๋‚ฎ์•„์ ธ๋„ ํ‰๊ท ์ ์œผ๋กœ ๋†’์€ ๊ฐ’์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ

  • ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜ = 2

1๋ฒˆ ๊ตฐ์ง‘์˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋Š” ํ‰๊ท  ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’ ์ด์ƒ์ด์ง€๋งŒ, 2๋ฒˆ ๊ตฐ์ง‘์˜ ๊ฒฝ์šฐ๋Š” ํ‰๊ท  ๋ณด๋‹ค ์ ์€ ๋ฐ์ดํ„ฐ ๊ฐ’์ด ๋งค์šฐ ๋งŽ์Œ

  • ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜ = 3

0๋ฒˆ์˜ ๊ฒฝ์šฐ ๋ชจ๋‘ ํ‰๊ท ๋ณด๋‹ค ๋‚ฎ์Œ. 0๋ฒˆ์˜ ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ ๊ฐ„ ๊ฑฐ๋ฆฌ๋„ ๋ฉ€์ง€๋งŒ 2๋ฒˆ ๊ตฐ์ง‘๊ณผ๋„ ๊ฐ€๊น๊ฒŒ ์œ„์น˜

  • ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜ = 4

4๊ฐœ์ธ ๊ฒฝ์šฐ๊ฐ€ ๊ฐ€์žฅ ์ด์ƒ์ ์ธ ๊ตฐ์ง‘ํ™” ๊ฐœ์ˆ˜๋กœ ํŒ๋‹จ ๊ฐ€๋Šฅ



โ–ถ๏ธŽ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๋ฅผ ํ†ตํ•œ k-ํ‰๊ท  ๊ตฐ์ง‘ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์€ ์ง๊ด€์ ์œผ๋กœ ์ดํ•ดํ•˜๊ธฐ ์‰ฝ์ง€๋งŒ, ๊ฐ ๋ฐ์ดํ„ฐ ๋ณ„๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์™€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ณ„์‚ฐํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๋ฐ์ดํ„ฐ์–‘์ด ๋Š˜์–ด๋‚˜๋ฉด์„œ ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ํฌ๊ฒŒ ๋Š˜์–ด๋‚จ



04. ํ‰๊ท  ์ด๋™ (Mean Shift)

ํ‰๊ท ์ด๋™์€ k-ํ‰๊ท ๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ์ค‘์‹ฌ์„ ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ์œผ๋กœ ์ง€์†์ ์œผ๋กœ ์›€์ง์ด๋ฉด์„œ ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰. k-ํ‰๊ท ์ด ์ค‘์‹ฌ์— ์†Œ์†๋œ ๋ฐ์ดํ„ฐ์˜ ํ‰๊ท  ๊ฑฐ๋ฆฌ ์ค‘์‹ฌ์œผ๋กœ ์ด๋™ํ•˜๋Š” ๋ฐ ๋ฐ˜ํ•ด, ํ‰๊ท  ์ด๋™์€ ์ค‘์‹ฌ์„ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ์—ฌ์žˆ๋Š” ๋ฐ€๋„๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๊ณณ์œผ๋กœ ์ด๋™์‹œํ‚ด

ํ™•๋ฅ  ๋ฐ€๋„ํ•จ์ˆ˜๊ฐ€ ํ”ผํฌ์ธ ์ ์„ ๊ตฐ์ง‘ ์ค‘์‹ฌ์ ์œผ๋กœ ์„ ์ • >> KDE๋ฅผ ์ด์šฉํ•ด ํ™•๋ฅ  ๋ฐ€๋„ ํ•จ์ˆ˜๋ฅผ ์ฐพ๋Š”๋‹ค.

KDE - kernel ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์–ด๋–ค ๋ณ€์ˆ˜์˜ ํ™•๋ฅ  ๋ฐ€๋„ ํ•จ์ˆ˜๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ์‹. ๊ด€์ธก ๋ฐ์ดํ„ฐ ๊ฐ๊ฐ์— ์ปค๋„ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ๊ฐ’์„ ์ „๋ถ€ ๋”ํ•œ ๋’ค ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜๋กœ ๋‚˜๋ˆ  ํ™•๋ฅ  ๋ฐ€๋„ ํ•จ์ˆ˜๋ฅผ ์ถ”์ •
=> KDE๋Š” ๊ฐœ๋ณ„ ๊ด€์ธก ๋ฐํ„ฐ์— ์ปค๋„ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ ๋’ค, ์ด ์ ์šฉ ๊ฐ’์„ ๋ชจ๋‘ ๋”ํ•œ ๋’ค ๊ฐœ๋ณ„ ๊ด€์ธก ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜๋กœ ๋‚˜๋ˆ  ํ™•๋ฅ  ๋ฐ€๋„ํ•จ์ˆ˜๋ฅผ ์ถ”์ •

K๋Š” ์ปค๋„ ํ•จ์ˆ˜, x๋Š” ํ™•๋ฅ  ๋ณ€์ˆ˜ ๊ฐ’, xi๋Š” ๊ด€์ธก๊ฐ’, h๋Š” ๋Œ€์—ญํญ

  • ๋Œ€์—ญํญ h (bandwidth)

KDE ํ˜•ํƒœ๋ฅผ ๋ถ€๋“œ๋Ÿฌ์šด ํ˜•ํƒœ๋กœ ํ‰ํ™œํ™” ํ•˜๋Š”๋ฐ ์ ์šฉ

์ž‘์€ h๊ฐ’์€ ์ข๊ณ  ๋พฐ์กฑํ•œ KDE๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋˜๋ฉฐ, ๊ณผ์ ํ•ฉ ๋˜๊ธฐ ์‰ฌ์›€. ํฐ h๊ฐ’์€ ๊ณผ๋„ํ•˜๊ฒŒ ํ‰ํ™œํ™”๋œ KDE๋กœ ์ธํ•ด ์ง€๋‚˜์น˜๊ฒŒ ๋‹จ์ˆœํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ํ™•๋ฅ  ๋ฐ€๋„ํ•จ์ˆ˜๋ฅผ ์ถ”์ •ํ•˜๋ฉฐ ๊ณผ์†Œ์ ํ•ฉ ๋˜๊ธฐ ์‰ฌ์›€

MeanShift ํด๋ž˜์Šค์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ดˆ๊ธฐํ™” ํŒŒ๋ผ๋ฏธํ„ฐ bandwidth

ํ‰๊ท  ์ด๋™ ๊ตฐ์ง‘ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ˆ์ œ

meanshift= MeanShift(bandwidth=1)
cluster_labels = meanshift.fit_predict(X)
print('cluster labels ์œ ํ˜•:', np.unique(cluster_labels))
[Output]
cluster labels ์œ ํ˜•: [0 1 2]

bandwidth๊ฐ€ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นจ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ. ์‚ฌ์ดํ‚ท๋Ÿฐ์€ ์ตœ์ ํ™”๋œ bandwidth ๊ฐ’์„ ์ฐพ๊ธฐ ์œ„ํ•ด estimate_bandwidth() ํ•จ์ˆ˜๋ฅผ ์ œ๊ณต

from sklearn.cluster import estimate_bandwidth

bandwidth = estimate_bandwidth(X)
print('bandwidth ๊ฐ’:', round(bandwidth,3))
  • 3๊ฐœ์˜ ๊ตฐ์ง‘์„ ์‹œ๊ฐํ™”

ํ‰๊ท ์ด๋™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ˆ˜ํ–‰ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๊ณ  ๋ฌด์—‡๋ณด๋‹ค๋„ bandwidth ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ๊ตฐ์ง‘ํ™” ์˜ํ–ฅ๋„๊ฐ€ ๋งค์šฐ ํผ. ๋ถ„์„ ์—…๋ฌด ๊ธฐ๋ฐ˜์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ณด๋‹ค๋Š” ์ปดํ“จํ„ฐ ๋น„์ „ ์˜์—ญ์—์„œ ๋” ๋งŽ์ด ์‚ฌ์šฉ๋จ.

05. GMM(Gaussian Mixture Model)

GMM ๊ตฐ์ง‘ํ™”๋Š” ๊ตฐ์ง‘ํ™”๋ฅผ ์ ์šฉํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ๋“ค์ด ์„ž์—ฌ์„œ ์ƒ์„ฑ๋œ ๊ฒƒ์ด๋ผ๋Š” ๊ฐ€์ •ํ•˜์— ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ (๊ฐ ๋ฐ์ดํ„ฐ ๊ตฐ์ง‘๋“ค์ด ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ ์ด๋ฃธ)

์„œ๋กœ ๋‹ค๋ฅธ ์ •๊ทœ ๋ถ„ํฌ์— ๊ธฐ๋ฐ˜ํ•ด ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰

GMM์€ ๊ฐœ๋ณ„ ์ •๊ทœ ๋ถ„ํฌ์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ/ ๊ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ค ์ •๊ทœ ๋ถ„ํฌ์— ํ•ด๋‹น๋˜๋Š” ์ง€์˜ ํ™•๋ฅ  ์„ ์ถ”์ •ํ•ด ๋ฐ˜ํ™˜

GMM์„ ์ด์šฉํ•œ ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๊ตฐ์ง‘ํ™”

  • ์ดˆ๊ธฐํ™” ํŒŒ๋ผ๋ฏธํ„ฐ n_components
    gaussian mixture ๋ชจ๋ธ์˜ ์ด ๊ฐœ์ˆ˜
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, random_state=0).fit(iris.data)
gmm_cluster_labels = gmm.predict(iris.data)

# ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ฒฐ๊ณผ๋ฅผ irisDF ์˜ 'gmm_cluster' ์ปฌ๋Ÿผ๋ช…์œผ๋กœ ์ €์žฅ
irisDF['gmm_cluster'] = gmm_cluster_labels
irisDF['target'] = iris.target

# target ๊ฐ’์— ๋”ฐ๋ผ์„œ gmm_cluster ๊ฐ’์ด ์–ด๋–ป๊ฒŒ ๋งคํ•‘๋˜์—ˆ๋Š”์ง€ ํ™•์ธ. 
iris_result = irisDF.groupby(['target'])['gmm_cluster'].value_counts()
print(iris_result)

GMM๊ณผ K-ํ‰๊ท ์˜ ๋น„๊ต

make_blobs()์™€ transformation(ํ–‰๋ ฌ ๋‚ด์  ์—ฐ์‚ฐ)์„ ์ด์šฉํ•ด ๋ญ‰์นœ ๋ฐ์ดํ„ฐ๋ฅผ ํƒ€์›ํ˜•์œผ๋กœ ๋ณ€ํ™˜

from sklearn.datasets import make_blobs

# make_blobs() ๋กœ 300๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์…‹, 3๊ฐœ์˜ cluster ์…‹, cluster_std=0.5 ์„ ๋งŒ๋“ฌ. 
X, y = make_blobs(n_samples=300, n_features=2, centers=3, cluster_std=0.5, random_state=0)

# ๊ธธ๊ฒŒ ๋Š˜์–ด๋‚œ ํƒ€์›ํ˜•์˜ ๋ฐ์ดํ„ฐ ์…‹์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋ณ€ํ™˜ํ•จ. 
transformation = [[0.60834549, -0.63667341], [-0.40887718, 0.85253229]]
X_aniso = np.dot(X, transformation)
# feature ๋ฐ์ดํ„ฐ ์…‹๊ณผ make_blobs( ) ์˜ y ๊ฒฐ๊ณผ ๊ฐ’์„ DataFrame์œผ๋กœ ์ €์žฅ
clusterDF = pd.DataFrame(data=X_aniso, columns=['ftr1', 'ftr2'])
clusterDF['target'] = y
# ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ ์…‹์„ target ๋ณ„๋กœ ๋‹ค๋ฅธ marker ๋กœ ํ‘œ์‹œํ•˜์—ฌ ์‹œ๊ฐํ™” ํ•จ. 
visualize_cluster_plot(None, clusterDF, 'target', iscenter=False)
  • KMeans ๊ตฐ์ง‘ํ™”

KMeans์œผ๋กœ ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•  ๊ฒฝ์šฐ, ์›ํ˜• ์œ„์น˜๋กœ ๊ฐœ๋ณ„ ๊ตฐ์ง‘ํ™”๊ฐ€ ๋˜๋ฉด์„œ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ตฌ์„ฑ๋˜์ง€ ์•Š์Œ

  • GMM ๊ตฐ์ง‘ํ™”

๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ„ํฌํ•œ ๋ฐฉํ–ฅ์— ๋”ฐ๋ผ ์ •ํ™•ํ•˜๊ฒŒ ๊ตฐ์ง‘ํ™” ๋์Œ์„ ํ™•์ธ ๊ฐ€๋Šฅ

06. DBSCAN (Density Based Spatial Clustering of Applications With Noise)

  • ์ฃผ์š” ํŒŒ๋ผ๋ฏธํ„ฐ

์ž…์‹ค๋ก  ์ฃผ๋ณ€ ์˜์—ญ (epsilon) : ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ์ž…์‹ค๋ก  ๋ฐ˜๊ฒฝ์„ ๊ฐ€์ง€๋Š” ์›ํ˜•์˜ ์˜์—ญ
์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ (min points) : ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ์˜ ์ž…์‹ค๋ก  ์ฃผ๋ณ€ ์˜์—ญ์— ํฌํ•จ๋˜๋Š” ํƒ€ ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜

  • ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ

ํ•ต์‹ฌ ํฌ์ธํŠธ(Core Point): ์ฃผ๋ณ€ ์˜์—ญ ๋‚ด์— ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ์ด์ƒ์˜ ํƒ€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์„ ๊ฒฝ์šฐ
์ด์›ƒ ํฌ์ธํŠธ (Neighbor Point): ์ฃผ๋ณ€ ์˜์—ญ ๋‚ด์— ์œ„์น˜ํ•œ ํƒ€ ๋ฐ์ดํ„ฐ
๊ฒฝ๊ณ„ ํฌ์ธํŠธ (Border Point): ์ฃผ๋ณ€ ์˜์—ญ ๋‚ด์— ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ์ด์ƒ์˜ ์ด์›ƒ ํฌ์ธํŠธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์ง€๋งŒ ํ•ต์‹ฌ ํฌ์ธํŠธ๋ฅผ ์ด์›ƒ ํฌ์ธํŠธ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ
์žก์Œ ํฌ์ธํŠธ (Noise Point): ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ์ด์ƒ์˜ ์ด์›ƒ ํฌ์ธํŠธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์œผ๋ฉฐ, ํ•ต์‹ฌ ํฌ์ธํŠธ๋„ ์ด์›ƒ ํฌ์ธํŠธ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ

DBSCAN ๊ณผ์ •

DBSCAN ์ ์šฉํ•˜๊ธฐ - ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ

  1. (eps = 0.6, min_samples=8)๋กœ ๊ตฐ์ง‘ํ™”
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.6, min_samples=8, metric='euclidean')
dbscan_labels = dbscan.fit_predict(iris.data)

irisDF['dbscan_cluster'] = dbscan_labels
irisDF['target'] = iris.target

iris_result = irisDF.groupby(['target'])['dbscan_cluster'].value_counts()
print(iris_result)
  1. PCA ์ด์šฉํ•ด 2๊ฐœ์˜ ํ”ผ์ฒ˜๋กœ ์••์ถ• ๋ณ€ํ™˜ ํ›„ ์‹œ๊ฐํ™”
from sklearn.decomposition import PCA
# 2์ฐจ์›์œผ๋กœ ์‹œ๊ฐํ™”ํ•˜๊ธฐ ์œ„ํ•ด PCA n_componets=2๋กœ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ณ€ํ™˜
pca = PCA(n_components=2, random_state=0)
pca_transformed = pca.fit_transform(iris.data)
# visualize_cluster_2d( ) ํ•จ์ˆ˜๋Š” ftr1, ftr2 ์ปฌ๋Ÿผ์„ ์ขŒํ‘œ์— ํ‘œํ˜„ํ•˜๋ฏ€๋กœ PCA ๋ณ€ํ™˜๊ฐ’์„ ํ•ด๋‹น ์ปฌ๋Ÿผ์œผ๋กœ ์ƒ์„ฑ
irisDF['ftr1'] = pca_transformed[:,0]
irisDF['ftr2'] = pca_transformed[:,1]

visualize_cluster_plot(dbscan, irisDF, 'dbscan_cluster', iscenter=False)
  1. (eps = 0.8, min_samples=8)๋กœ ๊ตฐ์ง‘ํ™”
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.8, min_samples=8, metric='euclidean')
dbscan_labels = dbscan.fit_predict(iris.data)

irisDF['dbscan_cluster'] = dbscan_labels
irisDF['target'] = iris.target

iris_result = irisDF.groupby(['target'])['dbscan_cluster'].value_counts()
print(iris_result)

visualize_cluster_plot(dbscan, irisDF, 'dbscan_cluster', iscenter=False)

  1. (eps = 0.6, min_samples=16)๋กœ ๊ตฐ์ง‘ํ™”
dbscan = DBSCAN(eps=0.6, min_samples=16, metric='euclidean')
dbscan_labels = dbscan.fit_predict(iris.data)

irisDF['dbscan_cluster'] = dbscan_labels
irisDF['target'] = iris.target

iris_result = irisDF.groupby(['target'])['dbscan_cluster'].value_counts()
print(iris_result)

visualize_cluster_plot(dbscan, irisDF, 'dbscan_cluster', iscenter=False)

DBSCAN ์ ์šฉํ•˜๊ธฐ - make_circles() ๋ฐ์ดํ„ฐ ์„ธํŠธ

  • k-means

  • GMM

  • DBSCAN
profile
for well-being we need nectar and ambrosia

0๊ฐœ์˜ ๋Œ“๊ธ€