2022-09-20

jmยท2022๋…„ 9์›” 22์ผ
0

TIL

๋ชฉ๋ก ๋ณด๊ธฐ
22/22

๐Ÿ“Œ scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ์ง€๋„ ํ•™์Šต

โœ๏ธ ๋ถ„๋ฅ˜๋Š” ์ข…์†๋ณ€์ˆ˜(y, ์ •๋‹ต, target)๊ฐ€ ์ด์ง„๊ณผ ๋‹ค์ค‘์œผ๋กœ ๋‚˜๋ˆ„์–ด์ง€๋Š”๋ฐ ์ด๋Š” ์—ฐ์†์ ์ธ ๊ฐ’์ด ์•„๋‹Œ ๋ฒ”์ฃผํ˜•(type, class...)์œผ๋กœ ์˜ค์ง€์„ ๋‹ค ๊ฐ™์€๊ฑฐ. ํšŒ๊ท€๋Š” ์—ฐ์†์ ์ธ ๊ฐ’.
โœ๏ธ ์ด์ง„๋ถ„๋ฅ˜๋Š” ์˜ˆ/์•„๋‹ˆ์˜ค ์‹์œผ๋กœ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ. ์…‹ ์ด์ƒ์˜ ํด๋ž˜์Šค๋Š” ๋‹ค์ค‘๋ถ„๋ฅ˜ ์˜ˆ/์•„๋‹ˆ์˜ค ์‹์˜ ์ •๋‹ต์ด ์•„๋‹ˆ๋ผ ํŠน์ • ์ •๋‹ต์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ.

โœ๏ธ ๋…๋ฆฝ๋ณ€์ˆ˜๋Š” x, Feature, Data๋ผ๊ณ  ๋งŽ์ด ํ•จ. ํšŒ๊ท€๋Š” ๋ถ€๋™์†Œ์ˆ˜์ ์ˆ˜(์‹ค์ˆ˜)๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ.

โœ๏ธ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ์ตœ๋Œ€๊ฐ€ ๋˜๋Š” ๋ชจ๋ธ์ด ์ตœ์ . ๋ชจ๋ธ์ด ๋ณต์žกํ• ์ˆ˜๋ก(ํ•™์Šต์„ ๋งŽ์ด ์‹œํ‚ค๋ฉด) ๊ณผ๋Œ€์ ํ•ฉ์ƒํƒœ๊ฐ€ ๋˜์–ด ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋‚ฌ์„ ๋•Œ ์ผ๋ฐ˜ํ™”๋˜์ง€ ๋ชปํ•œ๋‹ค.

โœ๏ธ k-์ตœ๊ทผ์ ‘ ์ด์›ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ - ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ์ตœ๊ทผ์ ‘ ์ด์›ƒ์œผ๋กœ ์ฐพ์•„ ์˜ˆ์ธก์— ์‚ฌ์šฉ.

โœ… ์˜ˆ์ œ

โœ”๏ธ iris(๋ถ“๊ฝƒ) ํ’ˆ์ข… ๋ถ„๋ฅ˜
โ–ถ๏ธ ๋…๋ฆฝ๋ณ€์ˆ˜(x, feature, data) : ๊ฝƒ์žŽ, ๊ฝƒ๋ฐ›์นจ์˜ ๊ธธ์ด(cm) 4๊ฐ€์ง€(length, width)
โ–ถ๏ธ ์ข…์†๋ณ€์ˆ˜(y, class, target) : ๊ฝƒ์˜ ํ’ˆ์ข…(setosa, virginica, versicolor)

(+ ๊ฝƒ์žŽ, ๊ฝƒ๋ฐ›์นจ ๊ธธ์ด์— ๋”ฐ๋ฅธ ์•„์ด๋ฆฌ์Šค ํ’ˆ์ข… ๋ถ„๋ฅ˜์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฝƒ์žŽ, ๊ฝƒ๋ฐ›์นจ ๊ธธ์ด๊ฐ€ ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ๋˜๊ณ  ํ’ˆ์ข…์ด ์ข…์†๋ณ€์ˆ˜๊ฐ€ ๋˜๋Š” ๊ฒƒ)

โœ๏ธ ๋ฐ์ดํ„ฐ ์ค€๋น„ํ•˜๊ธฐ

# ๋ฐ์ดํ„ฐ ์ค€๋น„ํ•˜๊ธฐ
from sklearn.datasets import load_iris
iris_dataset = load_iris()

๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๋ฉด numpy ๋ฐฐ์—ดํ˜•์‹์œผ๋กœ ๊ฐ’์ด ์ž…๋ ฅ๋˜์–ด์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

shape์„ ์ด์šฉํ•ด์„œ ํ™•์ธํ•˜๋ฉด ํ–‰๋ ฌ ํ˜•์‹(๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ์ƒ๊ฐํ•˜๋ฉด row๊ฐ€ 150๊ฐœ์ธ๊ฑฐ๊ณ  column์ด 4๊ฐœ์ธ๊ฒƒ.)


โœ๏ธ ์‚ฐ์ ๋„ ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ

import matplotlib.pyplot as plt
import pandas as pd

# ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ถ„์„ -> ๋…๋ฆฝ๋ณ€์ˆ˜(feature)์™€ ์ข…์†๋ณ€์ˆ˜(label)์˜ ์—ฐ๊ด€์„ฑ์„ ํ™•์ธ
iris_df = pd.DataFrame(iris_dataset['data'], columns=iris_dataset.feature_names)

# ๊ฐ ๋…๋ฆฝ๋ณ€์ˆ˜(feature)๋“ค์˜ ์‚ฐ์ ๋„ ํ–‰๋ ฌ 4x4

pd.plotting.scatter_matrix(iris_df, c=iris_dataset['target'], figsize=(15,15),
                           marker='o', hist_kwds={'bins':20}, s=60, alpha=.8)
plt.show()

pandas ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•์‹์œผ๋กœ ๋ถˆ๋Ÿฌ์™€์„œ ์‚ฐ์ ๋„ ์ฐจํŠธ๋ฅผ ๊ทธ๋ ค๋ณด๋ฉด

๊ฐ ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ๋ถ„ํฌํ˜•ํƒœ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
x์ถ•์ด petal length์ด๊ณ  y์ถ•์ด petal width์ธ ์‚ฐ์ ๋„๊ฐ€ ๊ฐ€์žฅ ์•„์ด๋ฆฌ์Šค ํ’ˆ์ข…์ด ๋ˆˆ์— ๋„๊ฒŒ ๋ถ„๋ฅ˜๋˜์–ด์žˆ๋‹ค.

import numpy as np

plt.imshow([np.unique(iris_dataset['target'])])
_ = plt.xticks(ticks=np.unique(iris_dataset['target']), labels=iris_dataset['target_names'])  # underscore -> _(๊ฐ’์„ ์ถœ๋ ฅํ•˜๊ณ  ์‹ถ์ง€ ์•Š์„ ๋•Œ ์‚ฌ์šฉ, ๋ณ€์ˆ˜์— ๋‹ด์œผ๋ฉด ์ถœ๋ ฅ์ด ์•ˆ๋˜๋‹ˆ๊นŒ)


์ข…์†๋ณ€์ˆ˜(target)์˜ ๋ณ€์ˆ˜๊ฐ’์ด ์–ด๋–ป๊ฒŒ ์ž…๋ ฅ๋˜์–ด์žˆ๋Š”์ง€ ํ™•์ธ.
setosa๊ฐ€ 0, versicolor๊ฐ€ 1, virginica๋Š” 2.

iris_df2 = iris_df[['petal length (cm)', 'petal width (cm)']]

์œ„์—์„œ ๊ฐ€์žฅ ์ ํ•ฉํ–ˆ๋˜ ์‚ฐ์ ๋„์˜ ๋ณ€์ˆ˜๋งŒ ๋ฝ‘์•„์„œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒˆ๋กœ ํ˜•์„ฑ.

pd.plotting.scatter_matrix(iris_df2, c=iris_dataset['target'], figsize=(15,15),
                           marker='o', hist_kwds={'bins':20}, s=60, alpha=.8)
plt.show()

์‚ฐ์ ๋„๋ฅผ ๊ทธ๋ฆฌ๋ฉด 2X2 ํ˜•ํƒœ๋กœ ๊ทธ๋ ค์ง


โœ๏ธ ํ›ˆ๋ จ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ

# ํ›ˆ๋ จ๋ฐ์ดํ„ฐ : ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ -> 7:3 or 75:25 or 80:20 or 90:10 ๋น„์œจ. 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'],  # ๋Œ€๋ฌธ์ž -> 2๊ฐœ ์ด์ƒ์ผ ๋•Œ, ์†Œ๋ฌธ์ž -> 1๊ฐœ
                                                    test_size=0.25, random_state=777)  # random_state ๊ณ ์ •๋œ ์‹œ๋“œ๋ถ€์—ฌ
                                                    
# ํ›ˆ๋ จ๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ธฐ 150 => 75% -> 112๊ฐœ

X_train.shape

# ํ…Œ์ŠคํŠธ๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ธฐ 150 => 25% -> 38๊ฐœ

X_test.shape

โœ๏ธ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ์„ค์ • -> k-์ตœ๊ทผ์ ‘ ์ด์›ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1) # ์ด์›ƒ์˜ ๊ฐœ์ˆ˜ 1๊ฐœ๋กœ ์ง€์ •

# ํ•™์Šตํ•˜๊ธฐ
knn.fit(X_train, y_train)

# ์˜ˆ์ธกํ•˜๊ธฐ
y_pred = knn.predict(X_test)

โœ๏ธ ๋ชจ๋ธ ํ‰๊ฐ€ํ•˜๊ธฐ

# ์ •ํ™•๋„ ํ™•์ธํ•˜๊ธฐ
# 1) mean() ํ•จ์ˆ˜ ์‚ฌ์šฉํ•ด์„œ ์ •ํ™•๋„ ํ™•์ธ
np.mean(y_pred == y_test)

๐Ÿ”ผ ๊ฒฐ๊ณผ

# 2) score() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ •ํ™•๋„ ํ™•์ธ -> ํ…Œ์ŠคํŠธ ์…‹์œผ๋กœ ์˜ˆ์ธกํ•œ ํ›„ ์ •ํ™•๋„ ์ถœ๋ ฅ
knn.score(X_test, y_test)

๐Ÿ”ผ ๊ฒฐ๊ณผ

# 3) ํ‰๊ฐ€ ์ง€ํ‘œ ๊ณ„์‚ฐ
from sklearn import metrics

knn_report = metrics.classification_report(y_test, y_pred)
print(knn_report)

๐Ÿ”ผ ๊ฒฐ๊ณผ

โœ”๏ธ forge
โ–ถ๏ธ ์ธ์œ„์ ์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ์ด์ง„๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ์…‹

# ์„ค์น˜
pip install mglearn

# ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ์…‹ ํ™•์ธํ•˜๊ธฐ
import mglearn
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

# ๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ
X , y = mglearn.datasets.make_forge()

# ๋ฐ์ดํ„ฐ ํ™•์ธํ•˜๊ธฐ
print('X.shape : ', X.shape)
print('y.shape : ', y.shape)
# ์‚ฐ์ ๋„ ๊ทธ๋ฆฌ๊ธฐ
plt.figure(dpi=100)
plt.rc('font', family='NanumBarunGothic')

mglearn.discrete_scatter(X[:,0], X[:,1], y)

plt.legend(['ํด๋ž˜์Šค 0', 'ํด๋ž˜์Šค 1'], loc=4)
plt.xlabel('์ฒซ ๋ฒˆ์งธ ํŠน์„ฑ')
plt.ylabel('๋‘ ๋ฒˆ์งธ ํŠน์„ฑ')


โœ๏ธ k-์ตœ๊ทผ์ ‘ ์ด์›ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜

# 1-์ตœ๊ทผ์ ‘
import mglearn
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

plt.figure(dpi=100)
mglearn.plots.plot_knn_classification(n_neighbors=1)

๐Ÿ”ผ ๊ฒฐ๊ณผ

# 3-์ตœ๊ทผ์ ‘
plt.figure(dpi=100)
mglearn.plots.plot_knn_classification(n_neighbors=3)

๐Ÿ”ผ ๊ฒฐ๊ณผ
: 1์ผ ๋•Œ์™€ 3์ผ ๋•Œ์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅด๋‹ค. = > ์ตœ์ ์ !


โœ๏ธ ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ ์ •์˜

# ๋ฐ์ดํ„ฐ ์ค€๋น„ํ•˜๊ธฐ
X, y = mglearn.datasets.make_forge() # X : ๋ฐ์ดํ„ฐ(feature, ๋…๋ฆฝ๋ณ€์ˆ˜), y : ๋ ˆ์ด๋ธ”(label, ์ข…์†๋ณ€์ˆ˜)
from sklearn.model_selection import train_test_split

# ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ -> ํ›ˆ๋ จ์…‹, ํ…Œ์ŠคํŠธ์…‹
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)  # 75:25

X_train.shape  # 26 -> 19
X_test.shape  # 26 -> 7

# k-์ตœ๊ทผ์ ‘ ์ด์›ƒ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ์„ค์ • - ์œ„์˜ ๊ฒฐ๊ณผ ํ† ๋Œ€๋กœ
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3)

# ๋ชจ๋ธ ํ•™์Šตํ•˜๊ธฐ
clf.fit(X_train, y_train)

# score ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก ์ •ํ™•๋„ ํ™•์ธ
clf.score(X_test, y_test)
clf.score(X_train, y_train) # -> ๊ณผ๋Œ€์ ํ•ฉ ์ƒํ™ฉ

โœ๏ธ KNeighborsClassifier ์ด์›ƒ์˜ ์ˆ˜์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅํ‰๊ฐ€

# ์ด์›ƒ์˜ ์ˆ˜์— ๋”ฐ๋ฅธ ์ •ํ™•๋„๋ฅผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ ๋ณ€์ˆ˜
train_scores = []
test_scores = []

n_neighbors_settings = range(1,15)

# 1 ~ 10๊นŒ์ง€ n_neighbors์˜ ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ์„œ ํ•™์Šต ํ›„ ์ •ํ™•๋„ ์ €์žฅ
for n_neighbor in n_neighbors_settings:
  # ๋ชจ๋ธ ์ƒ์„ฑ
  clf = KNeighborsClassifier(n_neighbors=n_neighbor)
  clf.fit(X_train, y_train)

  # ํ›ˆ๋ จ ์„ธํŠธ ์ •ํ™•๋„ ์ €์žฅ
  train_scores.append(clf.score(X_train, y_train))

  # ํ…Œ์ŠคํŠธ ์„ธํŠธ ์ •ํ™•๋„ ์ €์žฅ
  test_scores.append(clf.score(X_test, y_test))

# ์˜ˆ์ธก ์ •ํ™•๋„ ๋น„๊ต ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
plt.figure(dpi=100)

plt.plot(n_neighbors_settings, train_scores, label='ํ›ˆ๋ จ ์ •ํ™•๋„')
plt.plot(n_neighbors_settings, test_scores, label='ํ…Œ์ŠคํŠธ ์ •ํ™•๋„')
plt.ylabel('์ •ํ™•๋„')
plt.xlabel('์ด์›ƒ์˜ ์ˆ˜')
plt.legend()
plt.show()

# ์ตœ์ ์ ์€ 3!

โœ”๏ธ ์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•œ ์•…์„ฑ ์ข…์–‘(label 1) ์˜ˆ์ธกํ•˜๊ธฐ
โ–ถ๏ธ ๋…๋ฆฝ๋ณ€์ˆ˜(x, feature, data) : cancer.feature_names ์น˜๋ฉด ๋‚˜์˜ด
โ–ถ๏ธ ์ข…์†๋ณ€์ˆ˜(y, class, target) : ์•…์„ฑ, ์–‘์„ฑ

# ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

# ์‚ฐ์ ๋„ ๊ทธ๋ฆฌ๊ธฐ
import pandas as pd

df = pd.DataFrame(cancer['data'], columns=cancer.feature_names)
pd.plotting.scatter_matrix(df, c=cancer['target'], figsize=(15,15),
                           marker='o', hist_kwds={'bins':20}, s=10, alpha=.8)
plt.show()
# ์ข…์†๋ณ€์ˆ˜์˜ ๊ฐ’ ํ™•์ธ
import numpy as np

plt.imshow([np.unique(cancer['target'])])
_ = plt.xticks(ticks=np.unique(cancer['target']), labels=cancer['target_names'])

๐Ÿ”ผ ๊ฒฐ๊ณผ(0์ด ์•…์„ฑ, 1์ด ์–‘์„ฑ)

# ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌํ•˜๊ธฐ(ํ›ˆ๋ จ๋ฐ์ดํ„ฐ, ํ…Œ์ŠคํŠธ๋ฐ์ดํ„ฐ)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=7)

# ์ตœ์ ์  ์•Œ์•„๋ณด
# ์ด์›ƒ์˜ ์ˆ˜์— ๋”ฐ๋ฅธ ์ •ํ™•๋„๋ฅผ ์ €์žฅํ•  ๋ฆฌ์ŠคํŠธ ๋ณ€์ˆ˜
train_scores = []
test_scores = []

n_neighbors_settings = range(1,21)

# 1 ~ 10๊นŒ์ง€ n_neighbors์˜ ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ์„œ ํ•™์Šต ํ›„ ์ •ํ™•๋„ ์ €์žฅ
for n_neighbor in n_neighbors_settings:
  # ๋ชจ๋ธ ์ƒ์„ฑ
  clf = KNeighborsClassifier(n_neighbors=n_neighbor)
  clf.fit(X_train, y_train)

  # ํ›ˆ๋ จ ์„ธํŠธ ์ •ํ™•๋„ ์ €์žฅ
  train_scores.append(clf.score(X_train, y_train))

  # ํ…Œ์ŠคํŠธ ์„ธํŠธ ์ •ํ™•๋„ ์ €์žฅ
  test_scores.append(clf.score(X_test, y_test))

# ์˜ˆ์ธก ์ •ํ™•๋„ ๋น„๊ต ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
plt.figure(dpi=100)

plt.plot(n_neighbors_settings, train_scores, label='ํ›ˆ๋ จ ์ •ํ™•๋„')
plt.plot(n_neighbors_settings, test_scores, label='ํ…Œ์ŠคํŠธ ์ •ํ™•๋„')
plt.ylabel('์ •ํ™•๋„')
plt.xlabel('์ด์›ƒ์˜ ์ˆ˜')
plt.legend()
plt.show()

๐Ÿ”ผ ๊ฒฐ๊ณผ(์ตœ์ ์ ์€ 7-8!)

0๊ฐœ์˜ ๋Œ“๊ธ€