โ‘ฅ ๐Ÿค– Machine Learning 2์ผ์ฐจ - ๋ฐ์ดํ„ฐ ๋ถ„ํ• (Train-Test Split)

JItzelยท2025๋…„ 12์›” 11์ผ

๐Ÿก Machine_learning

๋ชฉ๋ก ๋ณด๊ธฐ
6/14

๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ(Train-Test Split)์™€ ๋ชจ๋ธ ํ‰๊ฐ€(์˜ค์ฐจํ–‰๋ ฌ)

1. Train-Test Split (ํ›ˆ๋ จ/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„ํ• )

ํ•„์š”์„ฑ

๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ, ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต๋ถ€(ํ•™์Šต)ํ•˜๋Š” ๋ฐ ์จ๋ฒ„๋ฆฌ๊ฒŒ ๋˜๋ฉด ๋ชจ๋ธ์ด ์ด๋ฏธ ๋‹ต์„ ์™ธ์›Œ๋ฒ„๋ ธ๊ธฐ ๋•Œ๋ฌธ์— ํ•ญ์ƒ ์ •๋‹ต์„ ๋งํ•œ๋‹ค.
but ์ฒ˜์Œ ๋ณด๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋‚˜๋ฉด ์—‰๋ง์ด ๋จ.

  • ๋ชฉ์ : ๋ชจ๋ธ์ด ๋ณด์ง€ ๋ชปํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์–ผ๋งˆ๋‚˜ ์ž˜ ์ž‘๋™ํ•˜๋Š”์ง€(์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ) ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•จ
  • ํ•ต์‹ฌ: ๊ณผ์ ํ•ฉ(Overfitting) ๋ฐฉ์ง€

์‚ฌ์šฉ ๋ฐฉ๋ฒ• (train_test_split)

  • Scikit-learn์˜ model_selection ๋ชจ๋“ˆ ์‚ฌ์šฉ
from sklearn.model_selection import train_test_split

# ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•
X_train, X_test, y_train, y_test = train_test_split(
    X,                # ํŠน์„ฑ ๋ฐ์ดํ„ฐ (Feature)
    y,                # ์ •๋‹ต ๋ฐ์ดํ„ฐ (Label)
    test_size=0.3,    # ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋น„์œจ (30%)
    random_state=42,  # ๋‚œ์ˆ˜ ์‹œ๋“œ (๋งค๋ฒˆ ๋˜‘๊ฐ™์ด ์„ž์ด๋„๋ก)
    stratify=y        # โญ ๋ถ„๋ฅ˜ ๋ฌธ์ œ ํ•„์ˆ˜ ์˜ต์…˜!
)

์ค‘์š”: stratify ์˜ต์…˜
์˜๋ฏธ: ์ธตํ™” ์ถ”์ถœ. ์ •๋‹ต ๋ฐ์ดํ„ฐ(yy)์˜ ํด๋ž˜์Šค ๋น„์œจ(์ •๋‹ต ๋น„์œจ)์„ ์œ ์ง€ํ•˜๋ฉฐ ๋‚˜๋ˆˆ๋‹ค.
Why?
๋งŒ์•ฝ ์•” ํ™˜์ž ๋ฐ์ดํ„ฐ๊ฐ€ ์ „์ฒด์˜ 5%๋ฐ–์— ์—†๋Š”๋ฐ, ๋ง‰ ๋‚˜๋ˆ„๋‹ค๊ฐ€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์•” ํ™˜์ž๊ฐ€ ํ•œ ๋ช…๋„ ์•ˆ ๋“ค์–ด๊ฐ„๋‹ค๋ฉด? ๋ชจ๋ธ์€ ํ•™์Šต์„ ์ œ๋Œ€๋กœ ํ•  ์ˆ˜ ์—†๋‹ค. ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ๋Š” stratify=y๋ฅผ ๊ผญ ์จ์ฃผ๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.

2. ์‹ค์Šต: Pima Indians Diabetes ๋ฐ์ดํ„ฐ ๋ถ„ํ• 

1) ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ๋ถ„ํ• 

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
df = pd.read_csv('data/pima-indians-diabetes.data.csv')

# ํŠน์„ฑ(X)๊ณผ ๋ผ๋ฒจ(y) ๋ถ„๋ฆฌ
# .values๋ฅผ ์จ์„œ Numpy Array ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜
x_data = df.iloc[:, :-1].values
y_data = df.iloc[:, -1].values

# ๋ฐ์ดํ„ฐ ๋ถ„ํ•  (7:3 ๋น„์œจ)
# ๋ฐ˜ํ™˜๋˜๋Š” 4๊ฐœ์˜ ๋ฐฐ์—ด์„ ์–ธํŒจํ‚น(Unpacking)์œผ๋กœ ๋ฐ›์Œ
x_train, x_test, y_train, y_test = train_test_split(
    x_data, y_data, 
    test_size=0.3, 
    stratify=y_data # ์ •๋‹ต ๋น„์œจ ์œ ์ง€
)

# ์ฐจ์›(Shape) ํ™•์ธ
print(f"ํ•™์Šต ๋ฐ์ดํ„ฐ: {x_train.shape}")
# (537, 8)

print(f"ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: {y_test.shape}")
# (231,) -> ์ „์ฒด 768๊ฐœ์˜ ์•ฝ 30%

2) ๋ชจ๋ธ ํ•™์Šต ๋ฐ ํ‰๊ฐ€ (Accuracy)

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ณ  ์ ์ˆ˜๋ฅผ ํ™•์ธํ•ด์ž.

  • ํšŒ๊ท€ ๋ชจ๋ธ์˜ score: ๊ฒฐ์ •๊ณ„์ˆ˜(R2R^2)
  • ๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ score: ์ •ํ™•๋„(Accuracy)
# ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
model = LogisticRegression(max_iter=500, verbose=True)
model.fit(x_train, y_train)

# 1. ํ•™์Šต ๋ฐ์ดํ„ฐ ์ ์ˆ˜ (Train Score)
train_score = model.score(x_train, y_train)
print(f"Train Accuracy: {train_score}") 
# ๊ฒฐ๊ณผ: 0.782... (์•ฝ 78.2%)

# 2. ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ ์ˆ˜ (Test Score)
test_score = model.score(x_test, y_test)
print(f"Test Accuracy: {test_score}")
# ๊ฒฐ๊ณผ: 0.774... (์•ฝ 77.4%)

๊ณผ์ ํ•ฉ(Overfitting) vs ๊ณผ์†Œ์ ํ•ฉ(Underfitting) ํŒ๋ณ„

  • ํ•™์Šต ์ ์ˆ˜์™€ ํ…Œ์ŠคํŠธ ์ ์ˆ˜์˜ ์ฐจ์ด๋ฅผ ๋ณด๊ณ  ๋ชจ๋ธ ์ƒํƒœ๋ฅผ ์ง„๋‹จ
์ƒํƒœ์ ์ˆ˜ ๋น„๊ต์„ค๋ช…ํ•ด๊ฒฐ ๋ฐฉ์•ˆ
๊ณผ์ ํ•ฉ (Overfitting)Train โ‰ซ Test๊ณต๋ถ€๋งŒ ๋„ˆ๋ฌด ์ž˜ํ•˜๊ณ  ์‘์šฉ์„ ๋ชปํ•จ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€, ๊ทœ์ œ(Regularization), ํ•™์Šต๋Ÿ‰ ์ค„์ด๊ธฐ
๊ณผ์†Œ์ ํ•ฉ (Underfitting)Test > Train๊ณต๋ถ€๋ฅผ ๋œ ํ•ด์„œ ๋‘˜ ๋‹ค ๋ชป ๋งž์ถคํ•™์Šต ๋ฐ˜๋ณต(Epoch) ๋Š˜๋ฆฌ๊ธฐ, ๋ชจ๋ธ ๋ณต์žก๋„ ๋†’์ด๊ธฐ
์ด์ƒ์ Train โ‰ˆ Test๋‘˜ ๋‹ค ์ ์ ˆํžˆ ๋†’์ŒGood!

โ†’\rightarrow ์œ„ ์‹ค์Šต ๊ฒฐ๊ณผ๋Š” 78.2% vs 77.4%๋กœ ์ฐจ์ด๊ฐ€ ํฌ์ง€ ์•Š์•„ ์ผ๋ฐ˜ํ™”๊ฐ€ ์ž˜ ๋œ ์ƒํƒœ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

3. ํ˜ผ๋™ ํ–‰๋ ฌ (Confusion Matrix)

  • ์ •ํ™•๋„(Accuracy)๋งŒ์œผ๋กœ๋Š” '์–ด๋–ป๊ฒŒ' ํ‹€๋ ธ๋Š”์ง€ ์•Œ ์ˆ˜ ์—†๋‹ค. ์•” ํ™˜์ž๋ฅผ ์ •์ƒ์œผ๋กœ ์˜ˆ์ธกํ•œ ๊ฑด์ง€(์œ„ํ—˜!), ์ •์ƒ์„ ์•” ํ™˜์ž๋กœ ์˜ˆ์ธกํ•œ ๊ฑด์ง€(ํ•ดํ”„๋‹) ๊ตฌ๋ณ„ํ•ด์•ผํ•œ๋‹ค.

1) ๊ตฌ์กฐ ์ดํ•ด

  • TN (True Negative): ์•„๋‹Œ ๊ฒƒ์„ ์•„๋‹ˆ๋ผ๊ณ  ์ž˜ ๋งž์ถค (์ •ํƒ)
  • FP (False Positive): ์•„๋‹Œ๋ฐ ๋งž๋‹ค๊ณ  ์ž˜๋ชป ์˜ˆ์ธก (์˜คํƒ)
  • FN (False Negative): ๋งž๋Š”๋ฐ ์•„๋‹ˆ๋ผ๊ณ  ์ž˜๋ชป ์˜ˆ์ธก (๋ฏธํƒ - ๊ฐ€์žฅ ์œ„ํ—˜ํ•  ์ˆ˜ ์žˆ์Œ)
  • TP (True Positive): ๋งž๋Š” ๊ฒƒ์„ ๋งž๋‹ค๊ณ  ์ž˜ ๋งž์ถค (์ •ํƒ)

2) Scikit-learn ์‹ค์Šต

from sklearn.metrics import confusion_matrix

# ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์˜ˆ์ธก๊ฐ’ ์ƒ์„ฑ
pred_train = model.predict(x_train)

# confusion_matrix(์‹ค์ œ๊ฐ’, ์˜ˆ์ธก๊ฐ’)
cm = confusion_matrix(y_train, pred_train)
print(cm)

# ๊ฒฐ๊ณผ
# [[312,  38],   -> ์‹ค์ œ 0(์ •์ƒ)์ธ ๋ฐ์ดํ„ฐ: 312๊ฐœ ๋งž์ถค / 38๊ฐœ ํ‹€๋ฆผ
#  [ 79, 108]]   -> ์‹ค์ œ 1(๋‹น๋‡จ)์ธ ๋ฐ์ดํ„ฐ: 79๊ฐœ ํ‹€๋ฆผ / 108๊ฐœ ๋งž์ถค

4. ์‹œ๊ฐํ™”: ํžˆํŠธ๋งต (Heatmap)

  • ์ˆซ์ž๋กœ ๋ณด๋Š” ๊ฒƒ๋ณด๋‹ค Seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํžˆํŠธ๋งต์„ ๊ทธ๋ฆฌ๋ฉด ํ›จ์”ฌ ์ง๊ด€์ ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Train ๋ฐ์ดํ„ฐ ํžˆํŠธ๋งต

import seaborn as sb
import matplotlib.pyplot as plt

# ํ•œ๊ธ€ ํฐํŠธ ์„ค์ •
matplotlib.rcParams['font.family']='Malgun Gothic'

sb.heatmap(
    cm, 
    annot=True,        # ์ˆซ์ž ํ‘œ์‹œ ์—ฌ๋ถ€
    fmt='d',           # ์ •์ˆ˜(decimal)๋กœ ํ‘œ์‹œ (์•ˆ ์“ฐ๋ฉด ์ง€์ˆ˜ ํ˜•ํƒœ 3.1e+2 ๋กœ ๋‚˜์˜ด)
    linewidths=0.2,    # ์นธ ์‚ฌ์ด ๊ฐ„๊ฒฉ
    cmap='Reds',       # ์ƒ‰์ƒ ํ…Œ๋งˆ
    xticklabels=['๋‹น๋‡จ์•„๋‹˜(์˜ˆ์ธก)', '๋‹น๋‡จ(์˜ˆ์ธก)'], 
    yticklabels=['๋‹น๋‡จ์•„๋‹˜(์‹ค์ œ)', '๋‹น๋‡จ(์‹ค์ œ)']
)

plt.title("Train Confusion Matrix")
plt.show()

Test ๋ฐ์ดํ„ฐ ํžˆํŠธ๋งต

  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ

pred_test = model.predict(x_test)
cm_test = confusion_matrix(y_test, pred_test)

sb.heatmap(
    cm_test, 
    annot=True, 
    fmt='d', 
    linewidths=0.2, 
    cmap='Reds', 
    xticklabels=['๋‹น๋‡จ์•„๋‹˜(์˜ˆ์ธก)', '๋‹น๋‡จ(์˜ˆ์ธก)'], 
    yticklabels=['๋‹น๋‡จ์•„๋‹˜(์‹ค์ œ)', '๋‹น๋‡จ(์‹ค์ œ)']
)
plt.title("Test Confusion Matrix")
plt.show()

์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธ

  • ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ์ด๋‚˜ ํŠน์„ฑ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ณผ ๋•Œ ์œ ์šฉํ•˜๋‹ค.
# df.corr(): ์ปฌ๋Ÿผ ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ (-1 ~ 1)
print(df.corr())
  • ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋‚ฎ๊ฒŒ ๋‚˜์˜จ๋‹ค๋ฉด?
    โ†’\rightarrow ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๊ฐ€ ๋” ํ•„์š”ํ•˜๊ฑฐ๋‚˜, ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์ด ๋‚ฎ์•„ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆฌ๊ธฐ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌ.

์š”์•ฝ

  1. Train-Test Split: ๋ฐ์ดํ„ฐ๋ฅผ ์ชผ๊ฐœ์„œ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๊ณ  ์„ฑ๋Šฅ์„ ๊ฐ๊ด€์ ์œผ๋กœ ํ‰๊ฐ€ํ•œ๋‹ค.
  2. Stratify: ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ๋Š” ์ •๋‹ต ๋น„์œจ์„ ๋งž์ถฐ์„œ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์ด ํ•„์ˆ˜๋‹ค.
  3. Confusion Matrix: ์ •ํ™•๋„ ๋„ˆ๋จธ, ๋ชจ๋ธ์ด ์–ด๋–ค ์œ ํ˜•์˜ ์˜ค๋ฅ˜๋ฅผ ๋ฒ”ํ•˜๋Š”์ง€(FP, FN) ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค.
  4. Heatmap: ํ˜ผ๋™ ํ–‰๋ ฌ์„ ์‹œ๊ฐํ™”ํ•˜์—ฌ ๋ถ„์„ํ•˜๊ธฐ ์ข‹๊ฒŒ ๋งŒ๋“ ๋‹ค.
profile
์†Œ๊ธˆ์— ์ ˆ์ธ ์ƒ์„ , ๋ชธ์„ ๋’ค์ฒ™์ด๋‹ค ๐ŸŸ

0๊ฐœ์˜ ๋Œ“๊ธ€