๐Ÿน ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜

๋ฏผ๋‹ฌํŒฝ์ด์šฐ์œ ยท2024๋…„ 7์›” 15์ผ

๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ข…๋ฅ˜๋“ค์„ ์•Œ์•„๋ณด์ž.

๐Ÿ’ก 1. ๋ฐ์ดํ„ฐ ๋‚˜๋ˆ„๊ธฐ

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

X = ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„.drop('์ข…์†๋ณ€์ˆ˜ ์—ด', axis=1)
y = ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„['์ข…์†๋ณ€์ˆ˜ ์—ด']

# ๋ชจ๋“ˆ ์ž„ํฌํŠธ
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ๋น„์œจ, random_state=๋žœ๋ค ์‹œ๋“œ๊ฐ’)

๐Ÿ’ก 2. ์„ ํ˜• ํšŒ๊ท€(Linear Regression)

๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ์ง์„ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋ฐฉ๋ฒ•

  • ๋‹จ์ˆœ ์„ ํ˜• ํšŒ๊ท€ ๋ถ„์„(๋‹จ์ผ ๋…๋ฆฝ๋ณ€์ˆ˜ ์ด์šฉ)
  • ๋‹ค์ค‘ ์„ ํ˜• ํšŒ๊ท€ ๋ถ„์„(๋‹ค์ค‘ ๋…๋ฆฝ๋ณ€์ˆ˜ ์ด์šฉ)
# ๋ชจ๋“ˆ ์ž„ํฌํŠธ
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train) # ํ•™์Šต
pred = lr.predict(X_test) # ์˜ˆ์ธก

๐Ÿ’ก 3. ์˜์‚ฌ ๊ฒฐ์ • ๋‚˜๋ฌด(Decision Tree)

  • ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ํŒจํ„ด์„ ํŒŒ์•…ํ•˜์—ฌ ๊ฒฐ์ • ๊ทœ์น™์„ ๋‚˜๋ฌด ๊ตฌ์กฐ๋กœ ๋‚˜ํƒ€๋‚ธ ๊ธฐ๊ณ„ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ๊ฐ„๋‹จํ•˜๊ณ  ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜๋กœ, ๋ถ„๋ฅ˜์™€ ํšŒ๊ท€ ๋ฌธ์ œ์— ๋ชจ๋‘ ์‚ฌ์šฉ
  • ์—”ํŠธ๋กœํ”ผ: ๋ฐ์ดํ„ฐ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ธก์ •. ํŠน์ • ์†์„ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„์—ˆ์„ ๋•Œ ์—”ํŠธ๋กœํ”ผ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ฐ์†Œํ•˜๋Š”์ง€๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ •๋ณด๋ฅผ ์–ป์Œ. ์ •๋ณด ์ด๋“์ด ๋†’์€ ์†์„ฑ์„ ์„ ํƒํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„๊ฒŒ ๋จ
  • ์ง€๋‹ˆ๊ณ„์ˆ˜: ๋ฐ์ดํ„ฐ์˜ ๋ถˆ์ˆœ๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•. ์ž„์˜๋กœ ์„ ํƒ๋œ ๋‘ ๊ฐœ์˜ ์š”์†Œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ํด๋ž˜์Šค์— ์†ํ•  ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ„. ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„๊ฐ€ ๋‚ฎ์„์ˆ˜๋ก ๋ฐ์ดํ„ฐ๊ฐ€ ์ž˜ ๋ถ„๋ฆฌ๋œ ๊ฒƒ
  • ์˜์‚ฌ ๊ฒฐ์ • ๋‚˜๋ฌด๋Š” ์˜ค๋ฒ„ํ”ผํŒ…์ด ๋งค์šฐ ์ž˜ ์ผ์–ด๋‚จ
    • ์˜ค๋ฒ„ํ”ผํŒ…(๊ณผ์ ํ•ฉ): ํ•™์Šต๋ฐ์ดํ„ฐ์—์„œ๋Š” ์ •ํ™•ํ•˜๋‚˜ ํ…Œ์ŠคํŠธ๋ฐ์ดํ„ฐ์—์„œ๋Š” ์„ฑ๊ณผ๊ฐ€ ๋‚˜์œ ํ˜„์ƒ์„ ๋งํ•จ.
    • ์˜ค๋ฒ„ํ”ผํŒ…์„ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•
      • ์‚ฌ์ „ ๊ฐ€์ง€์น˜๊ธฐ: ๋‚˜๋ฌด๊ฐ€ ๋‹ค ์ž๋ผ๊ธฐ ์ „์— ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ฉˆ์ถ”๋Š” ๋ฐฉ๋ฒ•
      • ์‚ฌํ›„ ๊ฐ€์ง€์น˜๊ธฐ: ๋‚˜๋ฌด๋ฅผ ๋๊นŒ์ง€ ๋‹ค ๋Œ๋ฆฐ ํ›„์— ๋ฐ‘์—์„œ๋ถ€ํ„ฐ ๊ฐ€์ง€๋ฅผ ์ณ ๋‚˜๊ฐ€๋Š” ๋ฐฉ๋ฒ•
# ๋ชจ๋“ˆ ์ž„ํฌํŠธ
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train) # ํ•™์Šต
pred = dtr.predict(X_test) # ์˜ˆ์ธก
sns.scatterplot(x=y_test, y=pred) # ์˜ˆ์ธก ์‹œ๊ฐํ™”

# ํŠธ๋ฆฌ ์‹œ๊ฐํ™” 
from sklearn.tree import plot_tree
plot.figure(dtr, max_depth=ํ™”๋ฉด์— ๋ณด์—ฌ์ค„ ๊นŠ์ด, font_size=ํฐํŠธ ์‚ฌ์ด์ฆˆ)
plt.show()

๐Ÿ’ก 4. ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€

  • ๋‘˜ ์ค‘์˜ ํ•˜๋‚˜๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฌธ์ œ(์ด์ง„ ๋ถ„๋ฅ˜)๋ฅผ ํ’€๊ธฐ ์œ„ํ•œ ๋Œ€ํ‘œ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ์ด์ง„ ๋ถ„๋ฅ˜์— ์ ํ•ฉํ•˜์ง€๋งŒ, ๋‹คํ•ญ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—๋„ ํ™•์žฅ๋  ์ˆ˜ ์žˆ์Œ
  • ์˜ˆ์ธก(x) ๋ถ„๋ฅ˜(o)
  • ์ข…์† ๋ณ€์ˆ˜ Y๋Š” ๋‘ ๊ฐ€์ง€ ๋ฒ”์ฃผ ์ค‘ ํ•˜๋‚˜๋ฅผ ๊ฐ€์ง(์˜ˆ: 0 ๋˜๋Š” 1)
  • ํŠน์ • ๋ฒ”์ฃผ์˜ ์†ํ•  ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ
  • ์ผ๋ฐ˜ํ™” ์„ ํ˜• ๋ชจ๋ธ์˜ ์ผ์ข…์œผ๋กœ, ๋…๋ฆฝ ๋ณ€์ˆ˜์˜ ์„ ํ˜• ์กฐํ•ฉ์„ ๋กœ์ง€์Šคํ‹ฑ ํ•จ์ˆ˜(์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ข…์† ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ํ™•๋ฅ  ์ ์ˆ˜๋กœ ๋ณ€ํ™˜ (0~1)
  • ํ™•๋ฅ ์— ๋”ฐ๋ผ 0๊ณผ 1๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š”๋ฐ ์ž„๊ณ„๊ฐ’ ์„ค์ •์„ ํ†ตํ•ด 0๊ณผ 1๋กœ ๋‚˜๋ˆ„๋Š” ๊ธฐ์ค€์„ ์ •ํ•ด์ค„ ์ˆ˜ ์žˆ์Œ
# ๋ชจ๋“ˆ ์ž„ํฌํŠธ
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train) # ํ•™์Šต
pred = lr.predict(X_test) # ์˜ˆํŠน

# ์ •ํ™•๋„ ๊ตฌํ•˜๊ธฐ
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)
# ๋ฐ์ดํ„ฐ ์ ๋ฆผ ํ˜„์ƒ ๋“ฑ์˜ ์ด์œ ๋กœ accuracy_score๋งŒ์œผ๋กœ๋Š” ํ•™์Šต์ด ์ œ๋Œ€๋กœ ๋๋Š” ์ง€ ์•Œ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๊ฐ€ ํ™•์ธ ํ•„์š”
๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„[์ข…์†๋ณ€์ˆ˜].value_counts()

๐Ÿ’ก 5. ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ

5-1. ์•™์ƒ๋ธ” ๋ชจ๋ธ

  • ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์ด์šฉํ•ด ์ตœ์ ์˜ ๋‹ต์„ ์ฐพ์•„๋‚ด๋Š” ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ
  • ๋ณดํŒ…(Voting)
    • ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ model์„ ์กฐํ•ฉํ•ด์„œ ์‚ฌ์šฉ
    • ๋ชจ๋ธ์— ๋Œ€ํ•ด ํˆฌํ‘œ๋กœ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœ
  • ๋ฐฐ๊น…(Bagging)
    • ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‚ด์—์„œ ๋‹ค๋ฅธ sample ์กฐํ•ฉ์„ ์‚ฌ์šฉ
    • ์ƒ˜ํ”Œ ์ค‘๋ณต ์ƒ์„ฑ์„ ํ†ตํ•ด ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœ
  • ๋ถ€์ŠคํŒ…(Boosting)
    • ์•ฝํ•œ ํ•™์Šต๊ธฐ๋“ค์„ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต์‹œ์ผœ ๊ฐ•๋ ฅํ•œ ํ•™์Šต๊ธฐ๋ฅผ ๋งŒ๋“ฆ
    • ์ด์ „ ์˜ค์ฐจ๋ฅผ ๋ณด์™„ํ•ด๊ฐ€๋ฉด์„œ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌ
    • ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜์ง€๋งŒ ์ž˜๋ชป๋œ ๋ ˆ์ด๋ธ”์ด๋‚˜ ์•„์›ƒ๋ผ์ด์–ด์— ๋Œ€ํ•ด ํ•„์š”์ด์ƒ์œผ๋กœ ๋ฏผ๊ฐ
    • AdaBoost, Gradient Boosting, XGBoost, LightGBM
  • ์Šคํƒœํ‚น(Stacking)
    • ๋‹ค์–‘ํ•œ ๊ฐœ๋ณ„ ๋ชจ๋ธ๋“ค์„ ์กฐํ•ฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ชจ๋ธ์„ ์ƒ์„ฑ
    • ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๋“ค์„ ํ•™์Šต์‹œ์ผœ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์–ป์€ ๋‹ค์Œ, ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๋“ค์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์ƒˆ๋กœ์šด ๋ฉ”ํƒ€ ๋ชจ๋ธ์„ ํ•™์Šต

5-2. ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ(Random Forest)

  • ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์•™์ƒ๋ธ” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ, ๊ฒฐ์ • ๋‚˜๋ฌด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ
  • ํ•™์Šต์„ ํ†ตํ•ด ๊ตฌ์„ฑํ•ด ๋†“์€ ๊ฒฐ์ • ๋‚˜๋ฌด๋กœ๋ถ€ํ„ฐ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋ฅผ ์ทจํ•ฉํ•ด์„œ ๊ฒฐ๋ก ์„ ์–ป๋Š” ๋ฐฉ์‹
  • ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์˜ ํŠธ๋ฆฌ๋Š” ์›๋ณธ ๋ฐ์ดํ„ฐ์—์„œ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ์ƒ˜ํ”Œ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•จ
  • ๊ฐ ํŠธ๋ฆฌ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋˜์–ด ๋‹ค์–‘ํ•œ ํŠธ๋ฆฌ๊ฐ€ ์ƒ์„ฑ๋˜๋ฉฐ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ์ด ์ฆ๊ฐ€ํ•จ
  • ๊ฐ๊ฐ์˜ ํŠธ๋ฆฌ๊ฐ€ ์˜ˆ์ธกํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์ˆ˜๊ฒฐ ๋˜๋Š” ํ‰๊ท ์„ ์ด์šฉํ•˜์—ฌ ์ตœ์ข… ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•จ
  • ๋ถ„๋ฅ˜์™€ ํšŒ๊ท€ ๋ฌธ์ œ์— ๋ชจ๋‘ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํŠนํžˆ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ๊ณ  ๋ณต์žกํ•œ ๊ฒฝ์šฐ์— ๋งค์šฐ ํšจ๊ณผ์ ์ธ ๋ชจ๋ธ
  • ์„ฑ๋Šฅ์€ ๊ฝค ์šฐ์ˆ˜ํ•œ ํŽธ์ด๋‚˜ ์˜ค๋ฒ„ํ”ผํŒ… ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Œ
# ๋ชจ๋“ˆ ์ž„ํฌํŠธ
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=2024)
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
proba = rf.predict_proba(X_test)

# ์ฒซ๋ฒˆ์งธ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์˜ˆ์ธก ๊ฒฐ๊ณผ
proba[0]
profile
์–ด๋–ป๊ฒŒ ํ–„์Šคํ„ฐ๊ฐ€ ๊ฐœ๋ฐœ์ž

0๊ฐœ์˜ ๋Œ“๊ธ€