[ML] GridSearchCV

박경국·2021년 12월 29일

Machine Learning

목록 보기

6/16

GridSearchCV를 활용해 최적의 하이퍼파라미터를 찾는 과정입니다.

1. 첫 번째 하이퍼파라미터 조정

pipe = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(random_state = 2)
)

dists = {'simpleimputer__strategy': ['mean', 'median'],
         'randomforestclassifier__n_estimators': [10, 50, 100], 
         'randomforestclassifier__max_depth': [10, 50, 100], 
         'randomforestclassifier__class_weight': ['balanced', None],
         'randomforestclassifier__max_features': ['auto','log2'], 
         'randomforestclassifier__min_samples_leaf' : [1, 3, 5]}

clf = GridSearchCV(
    pipe,
    param_grid = dists,
    cv = 3,
    scoring = 'f1',
    verbose = 1,
    n_jobs = -1
)

사용한 하이퍼파라미터의 distribution은 위와 같습니다.
최적의 하이퍼파라미터를 찾아가는 과정은 이렇게 설계했습니다.

n_estiamator를 예로 들어서 [10, 50, 100] 중에 최적값이 50으로 나오면, 50을 중심으로 [30, 50, 70]으로 범위를 줄여가려 합니다. 100을 넘으면 [100, 150, 200]으로 범위를 넓혔습니다.

첫번째 결과입니다.

class_weight : 'balanced',
max_depth : 50
max_features : 'auto'
min_samples_leaf: 5
n_estimators : 100
-simpleimputer : 'mean'
f1_score : 0.622

참고로 아무런 조정을 하지 않은 모델의 f1_score는 0.542입니다.

2. 두 번째 하이퍼파라미터 조정

dists = {'randomforestclassifier__n_estimators': [100, 150, 200], 
         'randomforestclassifier__max_depth': [30, 50, 70], 
         'randomforestclassifier__class_weight': ['balanced'],
         'randomforestclassifier__min_samples_leaf' : [5, 10, 15]}

문제였던 class_weight는 balance로 고정하고, max_feature, simpleimputer는 기본설정을 이용했습니다.

두번째 결과입니다.

class_weight : 'balanced',
max_depth : 30
min_samples_leaf : 10
n_estimators : 150
f1_score : 0.624

max_depth는 50 -> 30으로 낮췄을 때 더 성능이 좋고, min_samples_leaf는 좀 더 늘려야할 것 같습니다.
n_estimators는 150과 170 사이에서 최적을 찾아보겠습니다. f1 score는 0.002 상승했습니다.

3. 세 번째 하이퍼파라미터 조정

dists = {'randomforestclassifier__n_estimators': [150, 170, 180], 
         'randomforestclassifier__max_depth': [10, 20, 30], 
         'randomforestclassifier__class_weight': ['balanced'],
         'randomforestclassifier__min_samples_leaf' : [15, 30, 45]}

세번째 결과입니다.

max_depth : 30
min_samples_leaf : 15
n_estimators: 170
f1_score : 0.620

max_depth는 낮은 값과 비교했는데도 똑같이 30이 나왔습니다.
min_samples_leaf는 10에서 15가 나왔습니다. 두번째 시도에서 10과 15를 비교했을 때 10이 최적의 하이퍼파라미터였는데, 이번에는 15가 최적으로 선정됐습니다. 10과 15 사이에서 결정되는 것 같습니다. ramdom decision tree의 알고리즘이 랜덤 기반이기 때문에 고정된 값이 나오지 않는 것 같습니다.
n_estimators는 170이 나왔습니다 150~170 사이에 최적의 값이 있는 것 같습니다.
f1_score는 조금 낮아졌습니다.

마지막으로 한번 더 해보겠습니다.

4. 네 번째 하이퍼파라미터 조정

dists = {'randomforestclassifier__n_estimators': [150, 160, 170], 
         'randomforestclassifier__max_depth': [20, 30, 40], 
         'randomforestclassifier__class_weight': ['balanced'],
         'randomforestclassifier__min_samples_leaf' : [10, 15]}

마지막 결과입니다.

max_depth : 30
min_samples_leaf : 10
n_estimators: 150
f1 scroe : 0.624

네 번의 시도에서 각 하이퍼파라미터는 이렇게 바뀌었습니다.

class_weight : balanced
max_features : auto
imputer__strategy : mean
max_depth : 50 -> 30 -> 30 -> 30
min_samples_leaf : 5 -> 10 -> 15 -> 10
n_estimators: 100 -> 150 -> 170 -> 150
f1 scroe : 0.622 -> 0.624 -> 0.620 -> 0.624

하이퍼파라미터가 고정되어 있지 않고 일정 범위 안에서 값이 바뀌는 것을 확인하실 수 있습니다.
f1 scroe를 보니 1이나 10단위로 세밀하게 조정하지 않아도 성능상에는 큰 차이가 없는 것 같습니다.

그래도 어느정도 경향은 발견했는데,

max_depth는 숫자를 낮출 수록 스코어가 올라갔고(과적합 해결)
min_samples_leaf도 초기 값에서 파라미터를 올릴 수록 스코어가 올라갔습니다. 물론 15 이상에서는 성능이 떨어졌습니다(과소적합 문제 발생).
나무의 갯수인 n_estimators가 너무 많아도 과적합이 될 수 있는데, 보시다시피 150~170 사이에 형성되고 200 이상은 성능이 떨어집니다.

박경국

이전 포스트

[ML] Encoding

다음 포스트