[혼자 공부하는 머신러닝+딥러닝] #4 K-means, PCA

Clay Ryu's sound lab·2022년 2월 4일

Note for 2022

목록 보기

4/47

15강.k-평균 알고리즘 작동 방식을 이해하고 비지도 학습 모델 만들기

클러스터링, 군집 알고리즘

k평균 : 중심(픽셀의 평균)을 찾아주는 역할
가령 k개의 클러스터가 있다고 가정한다면 k개의 중심(센트로이드)을 구해가며 최적의 중심의 위치를 찾아가게 된다.
K-means알고리즘은 거리를 계산하는 것이기에 모양이 원형인 데이터셋을 필요로 한다.

모델훈련

from sklearn.cluster import KMeans
# 처음엔 랜덤하게 중심을 정한다.
# n_iter=10으로 디폴트되어 있다.
km=KMeans(n_clusters=3, random_state=42)
km.fit(fuits_2d)

print(km.labels_)
# 300개 각각의 라벨값

print(np.unique(km.labels_, return_counts=True))
# (array([0,1,2], dtype=int32), array([91,98,111]))
# 이 값들로 정확도를 알 수는 없다.

첫번째 클러스터

def draw_fruits(arr, ratio=1):
	n=len(arr)
    rows=int(np.ceil(n/10))
    cols=n if rows < 2 else 10
    fig, axs = plt.subplots(rows, cols, figsize=(cols*ratio, rows*ratio, squeeze=False)
    for i in range(rows):
    	for j in range(cols):
        	if i*10 + j < n :
            	asx[i,j].imshow(arr[i*10+j], cmap='gray_r')
            axs[i,j].axis('off')
    plt.show()
    
draw_fruits(fruits[km.labels_==0])
draw_fruits(fruits[km.labels_==1])
draw_fruits(fruits[km.labels_==2])

클러스터 중심

draw_fruits(km.cluster_centers_.reshape(-1,100,100), ratio=3)

print(km.transform(fruits_2d[100:101]))
# [[5267.7043 8837.3775 3393.8136]]
# 101번째 과일은 파인애플에 가장 가깝다.

print(km.predict(fruits_2d))
# [2]

draw_fruits(fruits[100:101])
# 파인애플이다

print(km.n_inter_)
# 3번 반복했다는 것을 알 수 있다.

최적의 k 찾기

엘보우 메소드 : inertia는 낮을수록 중심에 조밀하게 잘 모여 있다는 것이다. 당연히 k가 늘어날 수록 inertia는 작아질 것이다. 다만 inertia가 줄어들다가 완만하게 꺾이는 지점은 최적의 k가 될 가능성이 높다.

inertia = []
for k in range(2.7):
	km = KMeans(n_clusters=k, random_state=42)
    km.fit(fruits_2d)
    inertia.append(km.inertia_)
    
plt.plot(range(2,7), inertia)
plt.show()

16강.주성분 분석: 차원 축소 알고리즘 PCA 모델 만들기

차원축소

샘플이 행이고 특성이 열이라면 차원의 축소는 특성의 개수를 줄이는 것을 의미한다.

주성분

2개의 특성을 1개의 벡터인 주성분으로 표현한다면 이를 차원축소로 볼 수 있다. 가령 (4,2)를 (4.5)로 표현할 수 있다.
다음 주성분은 이전 주성분에 수직인 방향으로 구한다. 이같은 방식으로 n개의 특성은 n개의 주성분을 가질 수 있다.

PCA

from sklearn.decomposition import PCA

# 찾을 주성분의 개수, 10000개의 차원이 있는 데이터이다.
pca = PCA(n_components=50)
pca.fit(fruits_2d)

print(pca.components_.shape) # (50,10000)
# 10000개의 특성을 가지는 50개의 주성분이다.

draw_fruits(pca.components_.reshape(-1,100,100))

print(fruits_2d.shape) # (300, 10000)

# 10000개의 특성이 50개의 특성으로 줄어든다.
fruits_pca = pca.transform(fruits_2d)
print(fruits_pca.shape) # (300, 50)

재구성

줄인 특성이 분산을 가장 잘 드러내는 값들이기에 다시 복원을 해도 어느정도 복원이 된다.

fruits_inverse = pca.inverse_transfore(fruits_pca)
print(fruits_inverse.shape) # (300, 10000)

fruits_reconstruct = fruits_inverse.reshape(-1,100,100)

설명된 분산

print(np.sum(pca.explained_variance_ratio_)) # 0.921501
#92프로에 해당하는 값들을 보전하고 있다.

plt.plot(pca.explained_variance_ratio_)

분류기와 함께 사용하기

lr = LogisticRegression()
target = np.array([0]*100+[1]*100+[2]*100)

scores = cross_validate(lr, fruits_2d, target)
print(np.mean(scores['test_score'])) # 0.9966
print(np.mean(scores['tit_time'])) # 1.8380

scores = cross_validate(lr, fruits_pca, target)
print(np.mean(scores['test_score'])) # 1.0
print(np.mean(scores['tit_time'])) # 0.03938

pca = PCA(n_components=0.5) # 50퍼센트만 설명가능하게 하라
pca.fit(fruits_2d)
print(pca.n_components_) # 2

fruits_pca = pca.transform(fruits_2d)
print(fruits_pca.shape) # (300,2)

scores = cross_validate(lr, fruits_pca, target)
print(np.mean(scores['test_score'])) # 0.9933
print(np.mean(scores['tit_time'])) # 0.048157

군집과 함께 사용하기

km = KMeans(n_clusters=3, random_state=42)
km.fit(fruits_pca)

print(np.unique(km.labels_, return_counts=True))
# (array([0,1,2], dtype=int32), array([91,99,110]))

시각화

for label in range(0,3):
	data=fruits_pca[km.labels_==label]
    plt.scatter(data[:0], data[:1])
plt.legend(['apple','banana','pineapple'])
plt.show()