๐Ÿ’  AIchemist 3th Session | ๋ถ„๋ฅ˜(1) + ์บ๊ธ€ ํ•„์‚ฌ

yellowsubmarine372ยท2023๋…„ 10์›” 2์ผ

AIchemist

๋ชฉ๋ก ๋ณด๊ธฐ
5/14
post-thumbnail

๋ถ„๋ฅ˜ ๐Ÿ“Š

๋ถ„๋ฅ˜๋Š” ํ•™์Šต๋ฐ์ดํ„ฐ๋กœ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์˜ ํ”ผ์ฒ˜์™€ ๋ ˆ์ด๋ธ”๊ฐ’(๊ฒฐ์ • ๊ฐ’, ํด๋ž˜์Šค ๊ฐ’)์„ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ํ•™์Šตํ•ด ๋ชจ๋ธ์„ ์ƒ์„ฑ

๋ถ„๋ฅ˜ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • ๋ฒ ์ด์ฆˆ ํ†ต๊ณ„์™€ ์ƒ์„ฑ ๋ชจ๋ธ์— ๊ธฐ๋ฐ˜ํ•œ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ
  • ๋…๋ฆฝ๋ณ€์ˆ˜์™€ ์ข…์†๋ณ€์ˆ˜์˜ ์„ ํ˜• ๊ด€๊ณ„์„ฑ์— ๊ธฐ๋ฐ˜ํ•œ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€
  • ๋ฐ์ดํ„ฐ ๊ท ์ผ๋„์— ๋”ฐ๋ฅธ ๊ทœ์น™ ๊ธฐ๋ฐ˜์˜ ๊ฒฐ์ • ํŠธ๋ฆฌ
  • ๊ฐœ๋ณ„ ํด๋ž˜์Šค ๊ฐ„์˜ ์ตœ๋Œ€ ๋ถ„๋ฅ˜ ๋งˆ์ง„์„ ํšจ๊ณผ์ ์œผ๋กœ ์ฐพ์•„์ฃผ๋Š” ์„œํฌํŠธ ๋ฒกํ„ฐ ๋จธ์‹ 
  • ๊ทผ์ ‘ ๊ฑฐ๋ฆฌ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•˜๋Š” ์ตœ์†Œ ๊ทผ์ ‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ์‹ฌ์ธต ์—ฐ๊ฒฐ ๊ธฐ๋ฐ˜์˜ ์‹ ๊ฒฝ๋ง
  • ์„œ๋กœ ๋‹ค๋ฅธ ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฒฐํ•ฉํ•œ ์•™์ƒ๋ธ”

์•™์ƒ๋ธ”์€ ๋งค์šฐ ๋งŽ์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ ์•ฝํ•œ ํ•™์Šต๊ธฐ๋ฅผ ๊ฒฐํ•ฉํ•ด ํ™•๋ฅ ์  ๋ณด์™„๊ณผ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ ๋ถ€๋ถ„์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ„์† ์—…๋ฐ์ดํŠธ ํ•˜๋ฉด์„œ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š”๋ฐ, ๊ฒฐ์ •ํŠธ๋ฆฌ๊ฐ€ ์ข‹์€ ์•ฝํ•œ ํ•™์Šต๊ธฐ๋กœ ์‚ฌ์šฉ๋จ.

01. ๊ฒฐ์ • ํŠธ๋ฆฌ

๊ฒฐ์ •ํŠธ๋ฆฌ(Decision Tree)๋Š” ๋ฐ์ดํ„ฐ์— ์žˆ๋Š” ๊ทœ์น™์„ ํ•™์Šต์„ ํ†ตํ•ด ์ž๋™์œผ๋กœ ์ฐพ์•„๋‚ด ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜์˜ ๋ถ„๋ฅ˜ ๊ทœ์น™์„ ๋งŒ๋“œ๋Š” ๊ฒƒ

๊ทœ์น™๋…ธ๋“œ ๋…ธ๋“œ๋Š” ๊ทœ์น™ ์กฐ๊ฑด์ด ๋˜๊ณ  ๋ฆฌํ”„๋…ธ๋“œ๋Š” ๊ฒฐ์ •๋œ ํด๋ž˜์Šค ๊ฐ’

๊ฒฐ์ •๋…ธ๋“œ๋Š” ์ •๋ณด ๊ท ์ผ๋„๊ฐ€ ๋†’์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋จผ์ € ์„ ํƒํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ทœ์น™ ์กฐ๊ฑด์„ ๋งŒ๋“ฆ
1. ์„œ๋ธŒ ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ƒ์„ฑ
2. ์„œ๋ธŒ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๊ท ์ผ๋„๊ฐ€ ๋†’์€ ์ž์‹ ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ชผ๊ฐœ๋Š” ๋ฐฉ์‹์„ ์ž์‹ ํŠธ๋ฆฌ๋กœ ๋‚ด๋ ค๊ฐ€๋ฉฐ ๋ฐ˜๋ณตํ•˜๋Š” ๋ฐฉ์‹

  • ์ •๋ณด ๊ท ์ผ๋„

1) ์—”ํŠธ๋กœํ”ผ๋ฅผ ์ด์šฉํ•œ ์ •๋ณด ์ด๋“
1 - ์—”ํŠธ๋กœํ”ผ ์ง€์ˆ˜
(์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ’์ด ์„ž์—ฌ ์žˆ์„์ˆ˜๋ก ์—”ํŠธ๋กœํ”ผ๊ฐ€ ๋†’์Œ)

2) ์ง€๋‹ˆ ๊ณ„์ˆ˜ (๋ถˆํ‰๋“ฑ ์ง€์ˆ˜)
0์ด ๊ฐ€์žฅ ํ‰๋“ฑํ•˜๊ณ  1๋กœ ๊ฐˆ์ˆ˜๋ก ๋ถˆํ‰๋“ฑ
์ง€๋‹ˆ ๊ณ„์ˆ˜๊ฐ€ ๋‚ฎ์€ ์†์„ฑ์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„ํ• ํ•ด ๋ฐ์ดํ„ฐ ๊ท ์ผ๋„๊ฐ€ ๋†’๋„๋ก ํ•จ

  • DecisionTreeClassifier
  1. ๊ฒฐ์ • ํŠธ๋ฆฌ ๋ฃฐ์ด ๋ช…ํ™•
  2. ์–ด๋–ป๊ฒŒ ๊ทœ์น™๋“œ์™€ ๋ฆฌํ”„๋…ธ๋“œ๊ฐ€ ๋งŒ๋“ค์–ด์ง€๋Š” ์ง€ ์•Œ์ˆ˜ ์žˆ์Œ
  3. ์‹œ๊ฐํ™”๋กœ ํ‘œํ˜„ ๊ฐ€๋Šฅ

๋‹จ, ๊ณผ์ ํ•ฉ์œผ๋กœ ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง โ—

์ฐจ๋ผ๋ฆฌ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ์ƒํ™ฉ์„ ๋งŒ์กฑํ•˜๋Š” ์™„๋ฒฝํ•œ ๊ทœ์น™์€ ๋งŒ๋“ค ์ˆ˜ ์—†๋‹ค๊ณ  ๋จผ์ € ์ธ์ •ํ•˜๋Š” ํŽธ์ด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์žฅโœจ

  • ๊ฒฐ์ • ํŠธ๋ฆฌ ์‹œ๊ฐํ™”

Graphviz ํŒจํ‚ค์ง€ ์‚ฌ์šฉ
โ†’ export_graphviz()

  • ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ DecisionTreeClassifier ํ•™์Šต
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import warnings 
warnings.filterwarnings('ignore')

# DecisionTree Classifier ์ƒ์„ฑ 
dt_clf= DecisionTreeClassifier(random_state=156)

# ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋”ฉํ•˜๊ณ ,ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ๋ถ„๋ฆฌ 
iris_data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, test_size=0.2, random_state=11)

# DecisionTreeClassifier ํ•™์Šต 
dt_clf.fit(X_train, y_train)

from sklearn.tree import export_graphviz

# export_graphviz()์˜ ํ˜ธ์ถœ ๊ฒฐ๊ณผ๋กœ out_file๋กœ ์ง€์ •๋œ tree.dot ํŒŒ์ผ์„ ์ƒ์„ฑํ•จ
export_graphviz(dt_clf, out_file="tree.dot", class_names=iris_data.target_names, feature_names = iris_data.feature_names, impurity=True, filled = True)

import graphviz
# ์œ„์—์„œ ์ƒ์„ฑ๋œ tree.dot ํŒŒ์ผ์„ Graphviz๊ฐ€ ์ฝ์–ด์„œ ์ฃผํ”ผํ„ฐ ๋…ธํŠธ๋ถ์ƒ์—์„œ ์‹œ๊ฐํ™”
with open("tree.dot") as f:
    dot_graph = f.read()
    
graphviz.Source(dot_graph)


๊ทธ๋ž˜ํ”„ ์ƒ‰๊น”์ด ์ง™์–ด์งˆ์ˆ˜๋ก ์ง€๋‹ˆ๊ณ„์ˆ˜๊ฐ€ ๋‚ฎ๊ณ  ํ•ด๋‹น ๋ ˆ์ด๋ธ”์— ์†ํ•˜๋Š” ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ๋‹ค๋Š” ์˜๋ฏธ


์ž์‹๋…ธ๋“œ๊ฐ€ ์žˆ๋Š” ๋…ธ๋“œ ๋ธŒ๋žœ์น˜ ๋…ธ๋“œ, ๋ง๋‹จ ๋ฆฌํ”„ ๋…ธ๋“œ

petal_length <= 2.45 ๊ทœ์น™์ด True ๋˜๋Š” False๋กœ ๋ถ„๊ธฐํ•˜๊ฒŒ ๋˜๋ฉด 2๋ฒˆ, 3๋ฒˆ ๋…ธ๋“œ๊ฐ€ ๋งŒ๋“ค์–ด์ง.


โžก ๋ฏธ๋ฆฌ ์ œ์–ดํ•˜์ง€ ์•Š์œผ๋ฉด ์™„๋ฒฝํ•˜๊ฒŒ ํด๋ž˜์Šค ๊ฐ’์„ ๊ตฌ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด ํŠธ๋ฆฌ ๋…ธ๋“œ๋ฅผ ๊ณ„์†ํ•ด์„œ ๋งŒ๋“ค์–ด๋‚˜๊ฐ€๋Š” ๊ณผ์ ํ•ฉ ๋ฌธ์ œ๋ฅผ ๋ฐœ์ƒ

โžก ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ max_depth, min_samples_leaf(์ž์‹๋…ธ๋“œ ์ตœ์†Œ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜๊นŒ์ง€๋งŒ ๊ฐ€์ง€๋„๋ก ๋ถ„ํ• )

๊ฒฐ์ •ํŠธ๋ฆฌ๋Š” ๊ท ์ผ๋„์— ๊ธฐ๋ฐ˜ํ•ด ์–ด๋– ํ•œ ์†์„ฑ์„ ๊ทœ์น™ ์กฐ๊ฑด์œผ๋กœ ์„ ํƒํ•˜๋А๋ƒ๊ฐ€ ์ค‘์š”ํ•œ ์š”๊ฑด

  • ํ”ผ์ฒ˜๋ณ„ ์ค‘์š”๋„ ์ถ”์ถœ
import seaborn as sns
import numpy as np
%matplotlib inline

# feature importance ์ถ”์ถœ 
print("Feature importance:\n{0}".format(np.round(dt_clf.feature_importances_, 3)))

# feature๋ณ„ importance ๋งคํ•‘ 
for name, value in zip(iris_data.feature_names, dt_clf.feature_importances_):
    print('{0} : {1:.3f}'.format(name, value))

# feature importance๋ฅผ column ๋ณ„๋กœ ์‹œ๊ฐํ™” ํ•˜๊ธฐ 
sns.barplot(x=dt_clf.feature_importances_, y=iris_data.feature_names)

  • ๊ฒฐ์ • ํŠธ๋ฆฌ ๊ณผ์ ํ•ฉ

2๊ฐœ์˜ ํ”ผ์ฒ˜๊ฐ€ 3๊ฐ€์ง€ ์œ ํ˜•์˜ ํด๋ž˜์Šค ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ƒ์„ฑ

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
%matplotlib inline 

plt.title("3 Class values with 2 Features Sample data creation")

# 2์ฐจ์› ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•ด์„œ ํ”ผ์ฒ˜๋Š” 2๊ฐœ, ํด๋ž˜์Šค๋Š” 3๊ฐ€์ง€ ์œ ํ˜•์˜ ๋ถ„๋ฅ˜ ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ 
X_features, y_labels = make_classification(n_features=2, n_redundant=0, n_informative=2, n_classes=3, n_clusters_per_class=1, random_state = 0)

# ๊ทธ๋ž˜ํ”„ ํ˜•ํƒœ๋กœ 2๊ฐœ์˜ ํ”ผ์ฒ˜๋กœ 2์ฐจ์› ์ขŒํ‘œ ์‹œ๊ฐํ™”, ๊ฐ ํด๋ž˜์Šค ๊ฐ’์€ ๋‹ค๋ฅธ ์ƒ‰๊น”๋กœ ํ‘œ์‹œ๋จ.
plt.scatter(X_features[:,0], X_features[:,1], marker='o',  c=y_labels, s=25, edgecolor='k')

x, y์ถ•์ด 2๊ฐœ์˜ X_feature, 3๊ฐ€์ง€ ์ข…๋ฅ˜์˜ ์ƒ‰์ด 3๊ฐ€์ง€ ์œ ํ˜•์˜ ํด๋ž˜์Šค ๊ฐ’ y_labels ์˜๋ฏธ


[ํŠน์ •ํ•œ ํŠธ๋ฆฌ ์ƒ์„ฑ ์ œ์•ฝ ์—†๋Š” ๊ฒฐ์ •ํŠธ๋ฆฌ]

๊ฒฐ์ • ๊ธฐ์ค€ ๊ฒฝ๊ณ„๊ฐ€ ๋งค์šฐ ๋งŽ์•„์ ธ ์˜ˆ์ธก์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง€๊ฒŒ ๋จ

[min_samples_leaf=6]

# min_samples_leaf=6์œผ๋กœ ํŠธ๋ฆฌ ์ƒ์„ฑ ์กฐ๊ฑด์„ ์ œ์•ฝํ•œ ๊ฒฐ์ • ๊ฒฝ๊ณ„ ์‹œ๊ฐํ™”
dt_clf = DecisionTreeClassifier(min_samples_leaf = 6, random_state=156).fit(X_features, y_labels)
visualize_boundary(dt_clf, X_features, y_labels)

02. [์‹ค์Šต] ์‚ฌ์šฉ์ž ํ–‰๋™ ์ธ์‹ ๋ฐ์ดํ„ฐ ์„ธํŠธ

import pandas as pd

def get_human_dataset( ):
    
    # ๊ฐ ๋ฐ์ดํ„ฐ ํŒŒ์ผ๋“ค์€ ๊ณต๋ฐฑ์œผ๋กœ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ read_csv์—์„œ ๊ณต๋ฐฑ ๋ฌธ์ž๋ฅผ sep์œผ๋กœ ํ• ๋‹น.
    feature_name_df = pd.read_csv('./human_activity/features.txt',sep='\s+',
                        header=None,names=['column_index','column_name'])
    
    # ์ค‘๋ณต๋œ ํ”ผ์ฒ˜๋ช…์„ ์ˆ˜์ •ํ•˜๋Š” get_new_feature_name_df()๋ฅผ ์ด์šฉ, ์‹ ๊ทœ ํ”ผ์ฒ˜๋ช… DataFrame์ƒ์„ฑ. 
    new_feature_name_df = get_new_feature_name_df(feature_name_df)
    
    # DataFrame์— ํ”ผ์ฒ˜๋ช…์„ ์ปฌ๋Ÿผ์œผ๋กœ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•ด ๋ฆฌ์ŠคํŠธ ๊ฐ์ฒด๋กœ ๋‹ค์‹œ ๋ณ€ํ™˜
    feature_name = new_feature_name_df.iloc[:, 1].values.tolist()
    
    # ํ•™์Šต ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์…‹๊ณผ ํ…Œ์ŠคํŠธ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ์„ DataFrame์œผ๋กœ ๋กœ๋”ฉ. ์ปฌ๋Ÿผ๋ช…์€ feature_name ์ ์šฉ
    X_train = pd.read_csv('./human_activity/train/X_train.txt',sep='\s+', names=feature_name )
    X_test = pd.read_csv('./human_activity/test/X_test.txt',sep='\s+', names=feature_name)
    
    # ํ•™์Šต ๋ ˆ์ด๋ธ”๊ณผ ํ…Œ์ŠคํŠธ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ์„ DataFrame์œผ๋กœ ๋กœ๋”ฉํ•˜๊ณ  ์ปฌ๋Ÿผ๋ช…์€ action์œผ๋กœ ๋ถ€์—ฌ
    y_train = pd.read_csv('./human_activity/train/y_train.txt',sep='\s+',header=None,names=['action'])
    y_test = pd.read_csv('./human_activity/test/y_test.txt',sep='\s+',header=None,names=['action'])
    
    # ๋กœ๋“œ๋œ ํ•™์Šต/ํ…Œ์ŠคํŠธ์šฉ DataFrame์„ ๋ชจ๋‘ ๋ฐ˜ํ™˜ 
    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = get_human_dataset()

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# ์˜ˆ์ œ ๋ฐ˜๋ณต ์‹œ๋งˆ๋‹ค ๋™์ผํ•œ ์˜ˆ์ธก ๊ฒฐ๊ณผ ๋„์ถœ์„ ์œ„ํ•ด random_state ์„ค์ • 
dt_clf = DecisionTreeClassifier(random_state=156)
dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred)
print('๊ฒฐ์ • ํŠธ๋ฆฌ ์˜ˆ์ธก ์ •ํ™•๋„: {0:4f}'.format(accuracy))


# DecisionTreeClassifier์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ”์ถœ
print('DecisionTreeClassifier ๊ธฐ๋ณธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ:\n', dt_clf.get_params())
  • ํŠธ๋ฆฌ ๊นŠ์ด ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

GridSearchCV ์ด์šฉ

from sklearn.model_selection import GridSearchCV

params = {
    'max_depth' : [6,8,10,12,16,20,24],
    'min_samples_split': [16],
}

grid_cv = GridSearchCV(dt_clf, param_grid= params)
grid_cv.fit(X_train, y_train)
print('GridSearchCV ์ตœ๊ณ  ํ‰๊ท  ์ •ํ™•๋„ ์ˆ˜์น˜: {0:.4f}'.format(grid_cv.best_score_))
print('GridSearchCV ์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ:', grid_cv.best_params_)

max_depths = [6,8,10,12,16,20,24]
#max_depth ๊ฐ’์„ ๋ณ€ํ™”์‹œํ‚ค๋ฉด์„œ ๊ทธ๋•Œ๋งˆ๋‹ค ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ ์ธก์ • 
for depth in max_depths:
    dt_clf = DecisionTreeClassifier(max_depth=depth, min_samples_split=16, random_state=156)
    dt_clf.fit(X_train, y_train)
    pred = dt_clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred)
    print('max_depth = {0} ์ •ํ™•๋„: {1:.4f}'.format(depth, accuracy))
[Output]

max_depth = 6 ์ •ํ™•๋„: 0.8551
max_depth = 8 ์ •ํ™•๋„: 0.8717
max_depth = 10 ์ •ํ™•๋„: 0.8599
max_depth = 12 ์ •ํ™•๋„: 0.8571
max_depth = 16 ์ •ํ™•๋„: 0.8599
max_depth = 20 ์ •ํ™•๋„: 0.8565
max_depth = 24 ์ •ํ™•๋„: 0.8565
  • ์ตœ์ข… ๊ฒฐ๊ณผ
Fitting 5 folds for each of 8 candidates, totalling 40 fits
GridSearchCV ์ตœ๊ณ  ํ‰๊ท  ์ •ํ™•๋„ ์ˆ˜์น˜: 0.8549
GridSearchCV ์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ: {'max_depth': 8, 'min_samples_split': 16}

๊ฒฐ์ • ํŠธ๋ฆฌ ์˜ˆ์ธก ์ •ํ™•๋„:0.8717
  • ํ”ผ์ฒ˜ ์ค‘์š”๋„

03. ์•™์ƒ๋ธ” ํ•™์Šต

์•™์ƒ๋ธ” ํ•™์Šต์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๊ทธ ์˜ˆ์ธก์„ ๊ฒฐํ•ฉํ•จ์œผ๋กœ์จ ์ •ํ™•ํ•œ ์ตœ์ข… ์˜ˆ์ธก์„ ๋„์ถœํ•˜๋Š” ๊ธฐ๋ฒ•
๋‹ค์–‘ํ•œ ๋ถ„๋ฅ˜๊ธฐ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉํ•จ์œผ๋กœ์จ ๋‹จ์ผ ๋ถ„๋ฅ˜๊ธฐ๋ณด๋‹ค ์‹ ๋ขฐ์„ฑ์ด ๋†’์€ ์˜ˆ์ธก๊ฐ’์„ ์–ป๋Š” ๊ฒƒ

์•™์ƒ๋ธ” ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • ๋ณดํŒ…
  • ๋ฐฐ๊น… ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ
  • ๋ถ€์ŠคํŒ… ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…, XGboost, LightGBM
  • ๋ณดํŒ…๊ณผ ๋ฐฐ๊น…

์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ํˆฌํ‘œ๋ฅผ ํ†ตํ•ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐฉ์‹

๋ณดํŒ… ์ผ๋ฐ˜์ ์œผ๋กœ ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฒฐํ•ฉ
๋ฐฐ๊น… ๊ฐ๊ฐ์˜ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ๋ชจ๋‘ ๊ฐ™์€ ์œ ํ˜•์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ธฐ๋ฐ˜์ด์ง€๋งŒ, ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง์„ ์„œ๋กœ ๋‹ค๋ฅด๊ฒŒ ๊ฐ€์ ธ๊ฐ€๋ฉด์„œ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ด ๋ณดํŒ…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ

โ†’ ๊ฐœ๋ณ„ Classifier์—๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋งํ•ด์„œ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ์‹์„ ๋ถ€ํŠธ ์ŠคํŠธ๋ž˜ํ•‘ ๋ถ„ํ•  ๋ฐฉ์‹์ด๋ผ ํ•จ โ†” ๊ต์ฐจ ๊ฒ€์ฆ์€ ์ค‘์ฒฉ์„ ํ—ˆ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ๊ณผ ๋‹ค๋ฅด๊ฒŒ ๋ฐฐ๊น… ๋ฐฉ์‹์€ ์ค‘์ฒฉ์„ ํ—ˆ์šฉ

  • ๋ถ€์ŠคํŒ…

๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๋˜, ์•ž์—์„œ ํ•™์Šตํ•œ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์˜ˆ์ธก์ด ํ‹€๋ฆฐ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ๋‹ค์Œ ๋ถ„๋ฅ˜๊ธฐ์—๊ฒŒ๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜๋ฉด์„œ ํ•™์Šต๊ณผ ์˜ˆ์ธก์„ ์ง„ํ–‰


  • ๋ณดํŒ… ์œ ํ˜• HardVoting vs SoftVoting

1. ํ•˜๋“œ ๋ณดํŒ…
ํ•˜๋“œ ๋ณดํŒ…์€ ๋‹ค์ˆ˜์˜ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ๊ฒฐ์ •ํ•œ ์˜ˆ์ธก๊ฐ’์„ ์ตœ์ข… ๋ณดํŒ… ๊ฐ’์œผ๋กœ ์„ ์ •

2. ์†Œํ”„ํŠธ ๋ณดํŒ…
๊ฒฐ์ • ํ™•๋ฅ ์„ ๋ชจ๋‘ ๋”ํ•˜๊ณ  ์ด๋ฅผ ํ‰๊ท ํ•ด์„œ ์ด๋“ค ์ค‘ ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋†’์€ ๋ ˆ์ด๋ธ” ๊ฐ’์„ ์ตœ์ข… ๋ณดํŒ… ๊ฒฐ๊ด๊ฐ’์œผ๋กœ ์„ ์ •

โœจํ•˜๋“œ ๋ณดํŒ…๋ณด๋‹ค๋Š” ์†Œํ”„ํŠธ ๋ณดํŒ…์ด ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ์ข‹์•„์„œ ๋” ๋งŽ์ด ์‚ฌ์šฉ๋จ

  • ๋ณดํŒ… ๋ถ„๋ฅ˜๊ธฐ(VotingClassifier)

[์‹ค์Šต] ์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ๋ฐ์ดํ„ฐ ์„ธํŠธ ์˜ˆ์ธก ๋ถ„์„

import pandas as pd

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

cancer = load_breast_cancer()

data_df = pd.DataFrame(cancer.data, columns = cancer.feature_names)
data_df.head(3)

# ๊ฐœ๋ณ„ ๋ชจ๋ธ์€ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์™€ KNN์ž„
lr_clf = LogisticRegression(solver="liblinear")
knn_clf = KNeighborsClassifier(n_neighbors = 8)

# ๊ฐœ๋ณ„ ๋ชจ๋ธ์„ ์†Œํ”„ํŠธ ๋ณดํŒ… ๊ธฐ๋ฐ˜์˜ ์•™์ƒ๋ธ” ๋ชจ๋ธ๋กœ ๊ตฌํ˜„ํ•œ ๋ถ„๋ฅ˜๊ธฐ
vo_clf = VotingClassifier(estimators=[("LR", lr_clf), ("KNN", knn_clf)], voting='soft')

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=156)

              
#VotingClassifier ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€ 
vo_clf.fit(X_train, y_train)
pred = vo_clf.predict(X_test)
print("Voting ๋ถ„๋ฅ˜๊ธฐ ์ •ํ™•๋„ : {0:.4f}".format(accuracy_score(y_test, pred)))

# ๊ฐœ๋ณ„ ๋ชจ๋ธ์˜ ํ•™์Šต/ ์˜ˆ์ธก/ ํ‰๊ฐ€
classifiers = [lr_clf, knn_clf]
for classifier in classifiers:
    classifier.fit(X_train, y_train)
    pred = classifier.predict(X_test)
    class_name = classifier.__class__.__name__
    print("{0} ์ •ํ™•๋„ : {1: .4f}".format(class_name, accuracy_score(y_test, pred)))
    
# ๊ฐœ๋ณ„ ๋ชจ๋ธ์˜ ํ•™์Šต/ ์˜ˆ์ธก/ ํ‰๊ฐ€ 
classifiers = [lr_clf, knn_clf]
for classifier in classifiers:
    classifier.fit(X_train, y_train)
    pred = classifier.predict(X_test)
    class_name = classifier.__class__.__name__
    print('{0} ์ •ํ™•๋„: {1:.4f}'.format(class_name, accuracy_score(y_test, pred)))
[Output]

LogisticRegression ์ •ํ™•๋„: 0.9474
KNeighborsClassifier ์ •ํ™•๋„: 0.9386

๋ณดํŒ…์œผ๋กœ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๊ธฐ๋ฐ˜๋ถ„๋ฅ˜๊ธฐ๋ฅผ ๊ฒฐํ•ฉํ•œ๋‹ค๊ณ  ํ•ด์„œ ๋ฌด์กฐ๊ฑด ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜๊ธฐ๋ณด๋‹ค ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€๋Š” ์•Š์Œ.

ML ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์€ ๋‹ค์–‘ํ•œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์˜ํ•ด ๊ฒ€์ฆ๋˜๋ฏ€๋กœ ์–ด๋–ป๊ฒŒ ๋†’์€ ์œ ์—ฐ์„ฑ์„ ๊ฐ€์ง€๊ณ  ํ˜„์‹ค์— ๋Œ€์ฒ˜ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€๊ฐ€ ์ค‘์š”ํ•œ ML ๋ชจ๋ธ์˜ ํ‰๊ฐ€์š”์†Œ๊ฐ€ ๋จ


04. ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ์˜ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฒฐ์ •ํŠธ๋ฆฌ.

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋Š” ์—ฌ๋Ÿฌ๊ฐœ์˜ ๊ฒฐ์ • ํŠธ๋ฆฌ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ ๋ฐฐ๊น… ๋ฐฉ์‹์œผ๋กœ ๊ฐ์ž์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋งํ•ด ๊ฐœ๋ณ„์ ์œผ๋กœ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•œ๋’ค ์ตœ์ข…์ ์œผ๋กœ ๋ชจ๋“  ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ๋ณดํŒ…์„ ํ†ตํ•ด ์˜ˆ์ธก ๊ฒฐ์ •์„ ํ•จ.

  • ๋ถ€ํŠธ ์ŠคํŠธ๋ž˜ํ•‘

์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ค‘์ฒฉ๋˜๊ฒŒ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์„ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘ ๋ถ„ํ•  ๋ฐฉ์‹ ์ด๋ผํ•จ.

๋ฐ์ดํ„ฐ๊ฐ€ ์ค‘์ฒฉ๋œ ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๊ฒฐ์ • ํŠธ๋ฆฌ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ๊ฐ๊ฐ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ

  • [์‹ค์Šต] ์‚ฌ์šฉ์ž ํ–‰๋™์ธ์‹ ๋ฐ์ดํ„ฐ ์„ธํŠธ RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd 
import warnings 
warnings.filterwarnings('ignore')

# ๊ฒฐ์ • ํŠธ๋ฆฌ์—์„œ ์‚ฌ์šฉํ•œ get_human_dataset()๋ฅผ ์ด์šฉํ•ด ํ•™์Šต/ํ…Œ์ŠคํŠธ์šฉ DataFrame ๋ฐ˜ํ™˜ 
X_train, X_test, y_train, y_test = get_human_dataset()

# ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํ•™์Šต ๋ฐ ๋ณ„๋„์˜ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ์˜ˆ์ธก ์„ฑ๋Šฅ ํ‰๊ฐ€ 
rf_clf = RandomForestClassifier(random_state=0, max_depth=8)
rf_clf.fit(X_train, y_train)
pred = rf_clf.predict(X_test)
accuracy= accuracy_score(y_test, pred)
print('๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ์ •ํ™•๋„:{0:.4f}'.format(accuracy))
from sklearn.model_selection import GridSearchCV

params = {
    'max_depth': [8, 16, 24],
    'min_samples_leaf': [1, 6, 12],
    'min_samples_split': [2, 8, 16]
}

#RandomForestClassifier ๊ฐ์ฒด ์ƒ์„ฑ ํ›„ GridSearchCV ์ˆ˜ํ–‰ 
rf_clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
grid_cv = GridSearchCV(rf_clf, param_grid=params, cv=2, n_jobs=-1)
grid_cv.fit(X_train, y_train)

print('์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ:\n', grid_cv.best_params_)
print('์ตœ๊ณ  ์˜ˆ์ธก ์ •ํ™•๋„: {0:.4f}'.format(grid_cv.best_score_))

[Output]

์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ:
 {'max_depth': 16, 'min_samples_leaf': 6, 'min_samples_split': 2}
์ตœ๊ณ  ์˜ˆ์ธก ์ •ํ™•๋„: 0.9165
  • ํ”ผ์ฒ˜ ์ค‘์š”๋„

05. ์บ๊ธ€ ํ•„์‚ฌ ์‹ค์Šต

  • First Kaggle submit


titanic : Machine Learning from Distaster Competitions


  • Submission

test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

my kaggle first submission

profile
for well-being we need nectar and ambrosia

0๊ฐœ์˜ ๋Œ“๊ธ€