โ‘ก ๐Ÿค– Machine Learning 1์ผ์ฐจ - Scikit-learn ๊ธฐ์ดˆ ๋ฐ ์„ ํ˜• ํšŒ๊ท€

JItzelยท2025๋…„ 12์›” 10์ผ

๐Ÿก Machine_learning

๋ชฉ๋ก ๋ณด๊ธฐ
2/14

1. Scikit-learn ์ด๋ž€?

ํŒŒ์ด์ฌ์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค.
๋‹ค์–‘ํ•œ ๋ถ„๋ฅ˜, ํšŒ๊ท€, ๊ตฐ์ง‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ๊ณตํ•˜๋ฉฐ, ๋ฌธ์„œํ™”๊ฐ€ ๋งค์šฐ ์ž˜ ๋˜์–ด ์žˆ์–ด ์ž…๋ฌธ์ž๊ฐ€ ํ•™์Šตํ•˜๊ธฐ ์ข‹์Œ

์„ค์น˜

pip install scikit-learn

๊ณต์‹ DOC


2. ๋จธ์‹ ๋Ÿฌ๋‹ ํ•™์Šต ๋ฐฉ๋ฒ• ๋ถ„๋ฅ˜ (with Scikit-learn)

์‚ฌ์ดํ‚ท๋Ÿฐ์€ ๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ(์ •๋‹ต ์œ ๋ฌด)์™€ ๋ชฉ์ ์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ง€์›ํ•œ๋‹ค.

์ง€๋„ ํ•™์Šต (Supervised Learning)

  • ๋ชฉํ‘œ ๋ณ€์ˆ˜(์ •๋‹ต)์ด ์žˆ๋Š” ๊ฒฝ์šฐ

1) ๋ถ„๋ฅ˜(Classification)

  • ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ค ๋ฒ”์ฃผ(Class)์— ์†ํ•˜๋Š”์ง€ ์˜ˆ์ธก
  • Ex) ๊ณต๋ถ€ ์‹œ๊ฐ„, ์ถœ์„ ์ผ์ˆ˜์— ๋”ฐ๋ฅธ ํ•ฉ๊ฒฉ/๋ถˆํ•ฉ๊ฒฉ ์—ฌ๋ถ€, ํ•™์ (A/B) ์˜ˆ์ธก
  • ์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜: KNN, Naive Bayes(ํ†ต๊ณ„์  ํ™•๋ฅ  ๊ธฐ๋ฐ˜), ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€, ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด(Decision Tree), Random Forest, SVM

2) ํšŒ๊ท€(Regression/Estimation)

  • ๋ฐ์ดํ„ฐ์˜ ํŠน์ง•์„ ๋ฐ”ํƒ•์œผ๋กœ ์—ฐ์†๋œ ์ˆ˜์น˜(๊ฐ’)๋ฅผ ์ถ”์ •
  • Ex) ๊ณต๋ถ€ ์‹œ๊ฐ„, ์ถœ์„ ์ผ์ˆ˜์— ๋”ฐ๋ฅธ '์‹œํ—˜ ์ ์ˆ˜' ์˜ˆ์ธก
  • ์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์„ ํ˜• ํšŒ๊ท€(Linear Regression), ๋ฆฟ์ง€(Ridge), ๋ผ์˜(Lasso)

๋น„์ง€๋„ํ•™์Šต(Unsupervised Learning)

  • ๋ชฉํ‘œ ๋ณ€์ˆ˜(์ •๋‹ต)์ด ์—†๋Š” ๊ฒฝ์šฐ
  • ์ฐจ์› ์ถ•์†Œ: ๋ฐ์ดํ„ฐ์˜ ๋ณต์žก์„ฑ์„ ์ค„์ž„ (PCA)
  • ๊ตฐ์ง‘ํ™”: ๋ฐ์ดํ„ฐ๋ฅผ ๋น„์Šทํ•œ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ์Œ (K-means)
  • ์—ฐ๊ด€ ๊ทœ์น™: ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ์—ฐ๊ด€์„ฑ ๋ฐœ๊ฒฌ (์žฅ๋ฐ”๊ตฌ๋‹ˆ ๋ถ„์„)
  • Tip: ํ…์ŠคํŠธ๋‚˜ ์ด๋ฏธ์ง€ ๊ฐ™์€ ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋Š” ์ตœ๊ทผ LLM(๊ฑฐ๋Œ€์–ธ์–ด๋ชจ๋ธ)์ด๋‚˜ ๋”ฅ๋Ÿฌ๋‹์„ ์ฃผ๋กœ ํ™œ์šฉํ•˜๋Š” ์ถ”์„ธ.

3. Scikit-learn ์ฃผ์š” ๋ชจ๋“ˆ ์ •๋ฆฌ

๋ชจ๋“ˆ์„ค๋ช…
datasets์—ฐ์Šต์šฉ ๋‚ด์žฅ ์˜ˆ์ œ ๋ฐ์ดํ„ฐ ์„ธํŠธ (Iris, Boston ๋“ฑ)
preprocessing๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ (์ •๊ทœํ™”, ์Šค์ผ€์ผ๋ง, ์ธ์ฝ”๋”ฉ ๋“ฑ)
feature_selection์˜๋ฏธ ์žˆ๋Š” ํŠน์ง•(Feature)๋งŒ ์„ ํƒํ•˜๋Š” ๊ธฐ๋Šฅ
feature_extractionFeature ์ถ”์ถœ
decomposition์ฐจ์› ์ถ•์†Œ (PCA ๋“ฑ)
model_selectionํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ(train_test_split), ๊ต์ฐจ ๊ฒ€์ฆ, ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ ๋“ฑ
metrics๋ชจ๋ธ ์„ฑ๋Šฅ ํ‰๊ฐ€ (Accuracy, RMSE, ROC-AUC ๋“ฑ)
pipeline์ „์ฒ˜๋ฆฌ + ๋ชจ๋ธ๋ง ๋ฌถ์–ด์„œ ์‹คํ–‰
linear_model์„ ํ˜• ํšŒ๊ท€, ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€, SGD ๋“ฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜
svmSVM
neighborsKNN
naive_bayesNB ๋ชจ๋ธ
tree์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์•Œ๊ณ ๋ฆฌ์ฆ˜
ensemble์•™์ƒ๋ธ” ์•Œ๊ณ ๋ฆฌ์ฆ˜ (Random Forest ๋“ฑ)
cluster๋น„์ง€๋„ ๊ตฐ์ง‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜ (K-Means ๋“ฑ)

4. ์„ ํ˜•ํšŒ๊ท€ (Linear Regression)

์˜ˆ์ œ : ์ž๋™์ฐจ์˜ ์†๋„(speed)์— ๋”ฐ๋ฅธ ์ œ๋™ ๊ฑฐ๋ฆฌ(dist)๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋ณด์ž

1) ๋ฐ์ดํ„ฐ ์ค€๋น„

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

# ํ•œ๊ธ€ ํฐํ„ฐ ์„ค์ •(Windows ๊ธฐ์ค€)
matplotlib.rcParams['font.family']='Malgun Gothic'
matplotlib.rcParams['axes.unicode_minus'] = False

# ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ ์ž„ํฌํŠธ
# SGDRegressor๋Š” ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ, ์ด๋ฒˆ์—” ์ผ๋ฐ˜ LinearRegression ์‚ฌ์šฉ
from sklearn.linear_model import LinearRegression, SGDRegressor

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
carDF = pd.read_csv('cars.csv', index_col='Unnamed: 0')
carDF

2) ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ (์ค‘์š”: 2์ฐจ์› ๋ฐฐ์—ด ๋ณ€ํ™˜)

Scikit-learn์˜ ๋ชจ๋ธ๋“ค์€ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ(XX)๋กœ 2์ฐจ์› ๋ฐฐ์—ด(ํ–‰๋ ฌ)์„ ๊ธฐ๋Œ€
๋”ฐ๋ผ์„œ Series๋‚˜ 1์ฐจ์› ๋ฆฌ์ŠคํŠธ๋ฅผ DataFrame ํ˜•ํƒœ([[ ]]) ํ˜น์€ .reshape(-1, 1)์„ ํ†ตํ•ด ๋ณ€ํ™˜ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค.
fit(x, y)๋Š” ๋ฐ˜๋“œ์‹œ ํ–‰๋ ฌ(2D) ํ˜•ํƒœ ํ•„์š”

# ํŠน์„ฑ ๋ฐ์ดํ„ฐ(X)์™€ ๋ผ๋ฒจ(y) ๋ถ„๋ฆฌ
# .values๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ numpy array๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , 
# ๋Œ€๊ด„ํ˜ธ๋ฅผ ๋‘ ๋ฒˆ[[ ]] ์จ์„œ 2์ฐจ์› ๊ตฌ์กฐ ์œ ์ง€
x = carDF[['speed']].values  # (n_samples, n_features) ํ˜•ํƒœ
y = carDF[['dist']].values   # (n_samples, n_targets) ํ˜•ํƒœ

print(x.shape) # ๊ฒฐ๊ณผ ์˜ˆ: (50, 1) -> 2์ฐจ์› ํ™•์ธ ํ•„์ˆ˜!

3) ๋ชจ๋ธ ํ•™์Šต (Fit)

fit(x, y) ๋ฉ”์„œ๋“œ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ(์ง์„ ์˜ ๋ฐฉ์ •์‹)์˜ ๊ธฐ์šธ๊ธฐ(WW)์™€ ์ ˆํŽธ(bb)์„ ํ•™์Šต

model = LinearRegression()
model.fit(x, y) # ํ•™์Šต ์‹œ์ž‘

4) ํ•™์Šต ๊ฒฐ๊ณผ ํ™•์ธ (๊ธฐ์šธ๊ธฐ์™€ ์ ˆํŽธ)

ํ•™์Šต๋œ ๋ชจ๋ธ์ด ์ฐพ์€ ์ตœ์ ์˜ ์ง์„  ์‹: y=wx+by = wx + b

print('๊ธฐ์šธ๊ธฐ:', model.coef_)
print('์ ˆํŽธ:', model.intercept_)

# ๊ฒฐ๊ณผ
๊ธฐ์šธ๊ธฐ [[3.93240876]]
์ ˆํŽธ [-17.57909489]

โ†’\rightarrow ์ฆ‰, ์ด ๋ชจ๋ธ์€ dist = 3.93 * speed - 17.57 ์ด๋ผ๋Š” ๊ณต์‹์„ ๋„์ถœํ–ˆ๋‹ค.

5) ์˜ˆ์ธก (Predict)

์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•ด๋ณด๊ธฐ.
ํ•™์Šต ๋•Œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ž…๋ ฅ๊ฐ’์€ 2์ฐจ์› ๋ฐฐ์—ด์ด์–ด์•ผ ํ•œ๋‹ค.

# Case 1: ์ˆ˜์‹์œผ๋กœ ์ง์ ‘ ๊ณ„์‚ฐ (๋น„๊ถŒ์žฅ)
# ๊ฒฐ๊ณผ๋Š” ๋‚˜์˜ค์ง€๋งŒ, ๋ชจ๋ธ ๊ฐ์ฒด๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Œ
val = 10 * float(model.coef_) + float(model.intercept_)
print(val) # 21.74...

# Case 2: predict ๋ฉ”์„œ๋“œ ์‚ฌ์šฉ (๊ถŒ์žฅ) ๐Ÿ‘
# 2์ฐจ์› ๋ฐฐ์—ด๋กœ ์ž…๋ ฅํ•ด์•ผ ํ•จ์— ์ฃผ์˜! [[10]]
print(model.predict([[10]])) 
# array([[21.7449927]])

# ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฐ’ ๋™์‹œ ์˜ˆ์ธก
print(model.predict([[10], [15]]))
# array([[21.7449927],
#        [41.4070365]])

6) ์‹œ๊ฐํ™”

# 1. ์‹ค์ œ ๋ฐ์ดํ„ฐ (์‚ฐ์ ๋„)
plt.scatter(x, y, label='์‹ค์ œ๊ฐ’')

# 2. ์˜ˆ์ธก ๋ฐ์ดํ„ฐ (์„  ๊ทธ๋ž˜ํ”„)
# ์˜ˆ์ธก์„  ๊ทธ๋ฆฌ๊ธฐ ์œ„ํ•ด x์˜ ์ตœ์†Œ~์ตœ๋Œ€ ๋ฒ”์œ„ ์ƒ์„ฑ
pred = model.predict(x) 

plt.plot(x, pred, 'r--', label='์˜ˆ์ธก์„ ')
plt.show()

[์ด๋ฏธ์ง€ ์‚ฝ์ž…]


5. ์™œ ๋ฐ์ดํ„ฐ๋ฅผ 'ํ–‰๋ ฌ'๋กœ ์ฃผ์–ด์•ผ ํ• ๊นŒ?

  • ์‚ฌ์ดํ‚ท๋Ÿฐ์ด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ํ–‰๋ ฌ(Matrix) ํ˜•ํƒœ๋กœ ์š”๊ตฌํ•˜๋Š” ๊ทผ๋ณธ์ ์ธ ์ด์œ ๋Š” ์—ฐ์‚ฐ ํšจ์œจ์„ฑ ๋•Œ๋ฌธ์ด๋‹ค.

ํ–‰๋ ฌ ๊ณฑ(Matrix Multiplication)์˜ ์›๋ฆฌ

  • ์ปดํ“จํ„ฐ๋Š” ๋ฃจํ”„(Loop)๋ฅผ ๋Œ๋ฉฐ ํ•˜๋‚˜์”ฉ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค, ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ํ•œ ๋ฒˆ์— ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ๋น ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์„ ํ˜• ํšŒ๊ท€์˜ ์˜ˆ์ธก ๊ณต์‹์€ H(x)=wx+bH(x) = wx + b ์ด๋‹ค.

  • ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์•„์ง€๋ฉด ์ด๋ฅผ ํ–‰๋ ฌ์‹ H(X)=XW+bH(X) = XW + b ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ์ฒ˜๋ฆฌํ•œ๋‹ค. ์ด๋•Œ ํ–‰๋ ฌ ๊ณฑ์ด ์„ฑ๋ฆฝํ•˜๋ ค๋ฉด ์•ž ํ–‰๋ ฌ์˜ ์—ด ๊ฐœ์ˆ˜์™€ ๋’ค ํ–‰๋ ฌ์˜ ํ–‰ ๊ฐœ์ˆ˜๊ฐ€ ๊ฐ™์•„์•ผ ํ•œ๋‹ค(์ฃผ์˜!)

(Nร—M)โ‹…(Mร—K)=(Nร—K)(N \times M) \cdot (M \times K) = (N \times K)

Numpy ์‹ค์Šต

import numpy as np

# A ํ–‰๋ ฌ (2 x 2)
a = np.array([[1, 2], [3, 4]])

# B ํ–‰๋ ฌ (2 x 3)
b = np.array([[1, 2, 3], [3, 4, 5]])

# ํ–‰๋ ฌ ๊ณฑ ์—ฐ์‚ฐ
# ๋ฐฉ๋ฒ• 1: np.matmul
print(np.matmul(a, b))

# ๋ฐฉ๋ฒ• 2: @ ์—ฐ์‚ฐ์ž (ํŒŒ์ด์ฌ ์ถ”์ฒœ)
print(a @ b)

# ๊ฒฐ๊ณผ (2 x 3 ํ–‰๋ ฌ์ด ๋‚˜์˜ด)
# array([[ 7, 10, 13],
#        [15, 22, 29]])
  • ๊ฒฐ๋ก : H(x)=X@W+bH(x) = X @ W + b

์šฐ๋ฆฌ๊ฐ€ model.predict(X)๋ฅผ ํ˜ธ์ถœํ•  ๋•Œ, ๋‚ด๋ถ€์ ์œผ๋กœ๋Š” ์œ„์™€ ๊ฐ™์€ ํ–‰๋ ฌ ๊ณฑ ์—ฐ์‚ฐ์ด ์ผ์–ด๋‚˜๋ฉฐ ์ˆ˜๋งŒ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋„ ์ˆœ์‹๊ฐ„์— ์˜ˆ์ธก๊ฐ’์„ ๋‚ด๋†“์„ ์ˆ˜ ์žˆ๊ฒŒ๋œ๋‹ค.


์š”์•ฝ

  • Scikit-learn์€ ํŒŒ์ด์ฌ ๋จธ์‹ ๋Ÿฌ๋‹์˜ ํ‘œ์ค€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • ์ง€๋„ ํ•™์Šต์˜ ํšŒ๊ท€ ๋ฌธ์ œ๋ฅผ ํ’€๊ธฐ ์œ„ํ•ด LinearRegression์„ ์‚ฌ์šฉ
  • fit โ†’ weight/coef ํ•™์Šต(coef = ๊ธฐ์šธ๊ธฐ, intercept = y์ ˆํŽธ)
  • ๋ชจ๋ธ ํ•™์Šต(fit)๊ณผ ์˜ˆ์ธก(predict) ์‹œ ๋ฐ์ดํ„ฐ๋Š” ๋ฐ˜๋“œ์‹œ 2์ฐจ์› ํ–‰๋ ฌ(Array) ํ˜•ํƒœ์—ฌ์•ผ ํ•œ๋‹ค.
  • ๊ทธ ์ด์œ ๋Š” ๋‚ด๋ถ€์ ์œผ๋กœ ํ–‰๋ ฌ ๊ณฑ(@) ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ํšจ์œจ์„ ๋†’์ด๊ธฐ ๋•Œ๋ฌธ
profile
์†Œ๊ธˆ์— ์ ˆ์ธ ์ƒ์„ , ๋ชธ์„ ๋’ค์ฒ™์ด๋‹ค ๐ŸŸ

0๊ฐœ์˜ ๋Œ“๊ธ€