ํ˜ผ๊ณต ML+DL #12

myeongยท2022๋…„ 10์›” 3์ผ
0

ML+DL

๋ชฉ๋ก ๋ณด๊ธฐ
10/23

๐Ÿ“Œ ๊ฒ€์ฆ(Validation) Set

  • ํ›ˆ๋ จ Set 60% + ๊ฒ€์ฆ Set 20% || ํ…Œ์ŠคํŠธ Set 20%
  • ๋‘ ๋ฒˆ split
from sklearn.model_selection import train_test_split

train_input, test_input, train_target, test_target = train_test_split(
    data, target, test_size=0.2, random_state=42
)
sub_input, val_input, sub_target, val_target = train_test_split(
    train_input, train_target, test_size=0.2, random_state=42
)

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt.fit(sub_input, sub_target)

print(dt.score(sub_input, sub_target))
print(dt.score(val_input, val_target))

0.9971133028626413
0.864423076923077


๐Ÿ“ ๊ต์ฐจ ๊ฒ€์ฆ (only ML)

  • ๋ฐ์ดํ„ฐ๋ฅผ n๊ฐœ์˜ ํด๋“œ๋กœ ๋‚˜๋ˆ„๊ณ  ์ˆœ์„œ๋Œ€๋กœ ๊ฒ€์ฆ set๋กœ ์‚ฌ์šฉ
  • cross_validate() : ๊ธฐ๋ณธ 5๊ฐœ์˜ ํด๋“œ๋กœ ๋ถ„ํ•  -> 5๊ฐœ์˜ ๋ชจ๋ธ
from sklearn.model_selection import cross_validate

scores = cross_validate(dt, train_input, train_target)
print(scores)

import numpy as np
print(np.mean(scores['test_score']))

{'fit_time': array([0.00821614, 0.00822687, 0.00734401, 0.00731444, 0.00711036]),
'score_time': array([0.00078368, 0.0007267 , 0.00067711, 0.00068378, 0.00078654]),
'test_score': array([0.86923077, 0.84615385, 0.87680462, 0.84889317, 0.83541867])}

0.855300214703487 // ๊ฒ€์ฆ score ํ‰๊ท  (์ด๋ฆ„๋งŒ test)


๐Ÿ“ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜

  • GridSearchCV()

  • min_impurity_decrease : ๋ถ€๋ชจ์™€ ์ž์‹ ๋…ธ๋“œ๊ฐ„ ๋ถˆ์ˆœ๋„ ์ฐจ์ด(๊ฐ์†Œ์œจ)์˜ ์ตœ์†Ÿ๊ฐ’์„ ์ง€์ •ํ•˜์—ฌ ์ด๋ณด๋‹ค ์ž‘์œผ๋ฉด ๋ถ„ํ•  X
    (= ์ •๋ณด์ด๋“. ํด ์ˆ˜๋ก ์ข‹์€ ๋ถ„ํ• )

  • 5ํด๋“œ ๊ต์ฐจ๊ฒ€์ฆ x 5๋ฒˆ = 25๊ฐœ ๋ชจ๋ธ

  • n_jobs = -1 : cpu์˜ ๋ชจ๋“  ์ฝ”์–ด๋ฅผ ์‚ฌ์šฉํ•ด ๋ณ‘๋ ฌ๋กœ ์ž‘์—…

  • ์ตœ์ ์˜ ๊ฐ’์„ best_estimator_ ์†์„ฑ์— ๋„ฃ์–ด ๊ฒฐ์ • ํŠธ๋ฆฌ๋กœ

from sklearn.model_selection import GridSearchCV

params = {'min_impurity_decrease': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005]}
gs = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1)
gs.fit(train_input, train_target)

dt = gs.best_estimator_
print(dt.score(train_input, train_target))
print(gs.best_params_)
print(gs.cv_results_['mean_test_score'])

0.9615162593804117
{'min_impurity_decrease': 0.0001} // ์ตœ์ 
[0.86819297 0.86453617 0.86492226 0.86780891 0.86761605]
-> ๊ต์ฐจ๊ฒ€์ฆ ์ ์ˆ˜


๐Ÿ“ ๋žœ๋ค ์„œ์น˜

  • ๊ท ๋“ฑ ๋ถ„ํฌ ์ƒ˜ํ”Œ๋ง ํด๋ž˜์Šค : uniform(์‹ค์ˆ˜), randint(์ •์ˆ˜)
  • ๋„’์€ ๋ฒ”์œ„, ๋žœ๋ค ์ƒ˜ํ”Œ๋ง -> ๋ชจ๋ธ ํ›ˆ๋ จ ํšจ์œจ์ 
from scipy.stats import uniform, randint
rgen = randint(0, 10)
rgen.rvs(10)

array([1, 0, 4, 8, 6, 2, 2, 3, 6, 9]) // ๋žœ๋ค ์ˆ˜ ๋ฝ‘์•„์คŒ

  • RandomizedSearchCV
    n_iter : ์ƒ˜ํ”Œ๋ง ํšŸ์ˆ˜
params = {'min_impurity_decrease': uniform(0.0001, 0.001),
          'max_depth': randint(20, 50),
          'min_samples_split': randint(2, 25),
          'min_samples_leaf': randint(1, 25),
          }

from sklearn.model_selection import RandomizedSearchCV

gs = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), params, 
                        n_iter=100, n_jobs=-1, random_state=42)
gs.fit(train_input, train_target)

print(gs.best_params_)
print(np.max(gs.cv_results_['mean_test_score']))

dt = gs.best_estimator_
print(dt.score(test_input, test_target))

{'max_depth': 39, 'min_impurity_decrease': 0.00034102546602601173,
'min_samples_leaf': 7, 'min_samples_split': 13}

0.8695428296438884
// mean_test_score์˜ ์ตœ๋Œ“๊ฐ’ = best_params_ ๋กœ ์ฐพ์€ ๊ฒ€์ฆ๊ฐ’

0.86 // ์ตœ์ ๊ฐ’์œผ๋กœ Test



๐Ÿ”—ํ˜ผ๊ณต MLDL-12

0๊ฐœ์˜ ๋Œ“๊ธ€