교차검증, 하이퍼파라미터, 불균형 처리

림이·2024년 5월 9일

python 데이터분석 머신러닝

머신러닝/딥러닝

목록 보기

3/5

교차검증

학습 단계에서 일반화가 더욱 잘 수행되기 위해, 학습데이터를 여러 단계로 나누어 교차로 검증하는 기법
Train Set 내에서 Train Data와 Validation Data를 구분하여, 반복적으로 Validation Set을 바꿔가며 학습
- K-Fold Cross Validation : 학습 데이터를 특정 K 개수만큼 나누어, 교차로 검증데이터를 바꿔가며 학습 (k=5 / 5번의 학습이 수행)
- Stratified K-Fold Cross Validation : 학습 데이터를 특정 K 개수만큼 나누는데, 목표변수의 항목의 비율을 고려해서 (분류) K 개수만큼 분할

Hyper Parameter Tuning

Hyper Parameter : Model 내 세팅되어있는 수학적인 구조 / 학습 알고리즘에 세팅되어 있는 매개변수
Hyper Parameter Tuning: 학습 알고리즘 내 매개변수를 사용자가 사전에 조절
Overfitting 방지 / 학습, 일반화 성능을 효과적으로 올릴 수 있음
(알고리즘 내 존재하는 Hyper-Parameter를 알고 있어야 함)
- Random Search : 매개변수의 여러가지 조합을 무작위로 구성해 학습
- Grid Search : 모든 가능한 매개변수의 조합을 사용자가 직접 구성해 학습
- Bayesian Optimization : 베이지안 최적화 기법을 이용해, 매개변수를 Gaussian Process 이용하여 모델링 수행 -> 불확실성이 가장 낮은 매개변수를 선택

[예제] : '목표달성여부'를 분류하는 모델 만들기!

- X : 소요분 / 방송구분 / 판매단가 / ARS 금액 /수수료율 / '방송요일' / '방송월'
- Y : '목표달성여부'
- 학습 : 검증 = 8 : 2
- 특성 공학 기법 (결측처리(평균,최빈값) + 스케일링&인코딩)
- 알고리즘 (Decision Tree 알고리즘 / 하이퍼파라미터 튜닝 X)
- 평가

✔︎ X,Y 설정

Y = df1['목표달성여부']
X = df1[['소요분', '방송구분', '판매단가', 'ARS금액', '수수료율', '방송요일', '방송월']]

✔︎ 훈련을 위한 필요 라이브러리 불러오기

#학습데이터와 검증 데이터 분할
from sklearn.model_selection import train_test_split
# 학습과 특성공학이 같이 수행되는 파이프라인 구축
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
#특성공학 기법
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
#알고리즘
from sklearn.tree import DecisionTreeClassifier
#교차검증 + 하이퍼 파라미터 튜닝 기법
from sklearn.model_selection import GridSearchCV
# 평가
from sklearn.metrics import classification_report

✔︎ 학습데이터와 검증데이터 분할

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.2,
                                                   random_state=1234)

✔︎ 연속형 변수와 범주형 변수 구분

numeric_list = X.describe().columns # 숫자 데이터 구분 리스트
category_list = X.describe(include = 'object').columns # 문자 데이터 구분 리스트

✔︎ 각 데이터 타입 별 파이프라인 구축

handle_unknown = 'ignore'
: 학습데이터에 포함되지 않는 항목이 검증데이터나 새로운 데이터에 들어올 때, Unknown 항목으로 처리

numeric_pipe = make_pipeline(SimpleImputer(strategy = 'mean'), MinMaxScaler())
category_pipe = make_pipeline(SimpleImputer(strategy = 'most_frequent'),
                             OneHotEncoder(handle_unknown = 'ignore'))
                             
preprocess_pipe = make_column_transformer((numeric_pipe, numeric_list),(category_pipe, category_list))

✔︎ 특성공학 + 학습

model_pipe = make_pipeline(preprocess_pipe, DecisionTreeClassifier())
model_pipe.fit(X_train, Y_train)

교차검증

학습 단계에서 일반화가 더욱 잘 수행되기 위해, 학습데이터를 여러 단계로 나누어 교차로 검증하는 기법
Train Set 내에서 Train Data와 Validation Data를 구분하여, 반복적으로 Validation Set을 바꿔가며 학습
- K-Fold Cross Validation : 학습 데이터를 특정 K 개수만큼 나누어, 교차로 검증데이터를 바꿔가며 학습 (k=5 / 5번의 학습이 수행)
- Stratified K-Fold Cross Validation : 학습 데이터를 특정 K 개수만큼 나누는데, 목표변수의 항목의 비율을 고려해서 (분류) K 개수만큼 분할

Hyper Parameter Tuning

Hyper Parameter : Model 내 세팅되어있는 수학적인 구조 / 학습 알고리즘에 세팅되어 있는 매개변수
Hyper Parameter Tuning: 학습 알고리즘 내 매개변수를 사용자가 사전에 조절
Overfitting 방지 / 학습, 일반화 성능을 효과적으로 올릴 수 있음
(알고리즘 내 존재하는 Hyper-Parameter를 알고 있어야 함)
- Random Search : 매개변수의 여러가지 조합을 무작위로 구성해 학습
- Grid Search : 모든 가능한 매개변수의 조합을 사용자가 직접 구성해 학습
- Bayesian Optimization : 베이지안 최적화 기법을 이용해, 매개변수를 Gaussian Process 이용하여 모델링 수행 -> 불확실성이 가장 낮은 매개변수를 선택

✔︎ 하이퍼 파라미터 튜닝

max depth : 5가지 경우의 수 / split 5가지 경우의 수 = 25번
scoring : GridSearch + CV 가장 성능 좋은 모델을 도출함에 있어, 성능에 대한 평가 지표를 설정
n_jobs = -1 : 학습의 연산을 해당 컴퓨터 CPU so ahems Core에게 병렬로 분산

hyperparameter = {'decisiontreeclassifier__max_depth':range(5,11),
             'decisiontreeclassifier__min_samples_split':range(5,11)}

# 교차검증 3회 x 하이퍼파라미터 튜닝 25회 = 75회 학습 
grid_model = GridSearchCV(model_pipe, param_grid = hyperparameter,
                          cv=3, scoring = 'f1', n_jobs = -1)

grid_model.fit(X_train, Y_train)코드를 입력하세요

best_model = grid_model.best_estimator_

✔︎ 평가를 위한 평가함수 정의

def eval_func1(model):
    Y_train_pred = model.predict(X_train)
    Y_test_pred = model.predict(X_test)
    print('학습 성능')
    print(classification_report(Y_train, Y_train_pred))
    print('일반화 성능')
    print(classification_report(Y_test, Y_test_pred))

✔︎ 함수를 활용하여 평가

eval_func1(best_model)

>>> 학습 성능
              precision    recall  f1-score   support

           0       0.69      0.92      0.79     13042
           1       0.71      0.32      0.44      7984

    accuracy                           0.69     21026
   macro avg       0.70      0.62      0.61     21026
weighted avg       0.70      0.69      0.66     21026

일반화 성능
              precision    recall  f1-score   support

           0       0.68      0.90      0.77      3291
           1       0.62      0.29      0.39      1966

    accuracy                           0.67      5257
   macro avg       0.65      0.59      0.58      5257
weighted avg       0.66      0.67      0.63      5257

Imbalanced Data Sampling

분류모델에 있어, 불균형한 데이터 (Imbalanced Data)의 비율을 맞추어 학습
Under Sampling : 비율이 많은 쪽의 데이터를 줄여서, 적은 쪽에 맞춰 학습을 수행

✔︎ 사용할 데이터 불러오기 (질병 데이터)

df2 = pd.read_csv('12_Data.csv')
df2.head(2) # Diagnosis / B 정상 M 암

>>> 	Image ID	Diagnosis	Mean Radius	Mean Perimeter	Mean Area	Mean Texture	Mean Smoothness	Mean Compactness	Mean Concavity	Mean Concave Points	...	SE Radius	SE Perimeter	SE Area	SE Texture	SE Smoothness	SE Compactness	SE Concavity	SE Concave Points	SE Symmetry	SE Fractal Dim
0	842302	M	17.99	122.8	1001.0	10.38	0.12	0.27760	0.3001	0.1471	...	1.0950	8.589	153.40	0.9053	0.0064	0.0490	0.0537	0.0159	0.0300	0.0062
1	842517	M	20.57	132.9	1326.0	17.77	0.08	0.07864	0.0869	0.0702	...	0.5435	3.398	74.08	0.7339	0.0052	0.0131	0.0186	0.0134	0.0139	0.0035

✔︎ ['Diagnosis'] 컬럼 확인

df2['Diagnosis'].value_counts()

>>> Diagnosis
B    357
M    212
Name: count, dtype: int64

✔︎ Mean Radius 값과 Mean Concavity 에 따라 M/B 값이 어덯게 바뀌는지 확인

color_mapping = {'M': 'red', 'B':'blue'}
fig1 = px.scatter(df2, x='Mean Radius', y='Mean Concavity',
           color='Diagnosis', color_discrete_map=color_mapping)
fig1 # # Source Data

Random UnderSampling : 무작위로 비율이 많은 쪽의 데이터를 줄여 비율을 조정

✔︎ 라이브러리 불러오기

from imblearn.under_sampling import RandomUnderSampler

✔︎ X,Y 설정

X= df2[['Mean Radius', 'Mean Concavity']]
Y= df2['Diagnosis']

✔︎ Random Undersampling

sample_model = RandomUnderSampler()
X_resamp, Y_resamp = sample_model.fit_resample(X,Y)

X_resamp['Target'] = Y_resamp

from plotly.subplots import make_subplots

fig2 = px.scatter(X_resamp, x='Mean Radius', y='Mean Concavity',
          color='Target', color_discrete_map = color_mapping)

✔︎ Subplot 생성

figure = make_subplots(rows=1, cols=2, subplot_titles=('Source', 'Resample'))

✔︎ 각각의 그래프를 subplot에 추가

def figure_func():
    color_mapping = {'M': 'red', 'B':'blue'}
    fig1 = px.scatter(df2, x='Mean Radius', y='Mean Concavity',
           color='Diagnosis', color_discrete_map=color_mapping)
    fig2 = px.scatter(X_resamp, x='Mean Radius', y='Mean Concavity',
          color='Target', color_discrete_map = color_mapping)
    figure = make_subplots(rows=1, cols=2, subplot_titles=('Source', 'Resample'))
    # 각각의 그래프를 나누어 출력
    for trace in fig1.data:
        figure.add_trace(trace, row=1, col=1)
    for trace in fig2.data:
        figure.add_trace(trace, row=1, col=2)
    figure.show()
    
figure_func()

Tomek Link Sampling : 서로 다른 값이 인접한 데이터를 묶어 비율이 많은 쪽의 데이터를 제거

from imblearn.under_sampling import TomekLinks

sample_model = TomekLinks()
X_resamp, Y_resamp = sample_model.fit_resample(X,Y)
X_resamp['Target'] = Y_resamp

Y_resamp.value_counts()
>>> Diagnosis
B    342
M    212
Name: count, dtype: int64

Edited Nearest Neighbors (ENN) : 많은 쪽의 데이터를 특정 k개씩 묶어 인접한 데이터를 삭제해 나가며 비율을 맞추는 기법

from imblearn.under_sampling import EditedNearestNeighbours

sample_model = EditedNearestNeighbours()
X_resamp, Y_resamp = sample_model.fit_resample(X,Y)
X_resamp['Target'] = Y_resamp

Y_resamp.value_counts()
>>> Diagnosis
B    294
M    212
Name: count, dtype: int64

Over Sampling: 비율이 적은 쪽의 데이터를 생성하여 비율을 조정
Random Oversampling : 적은쪽의 데이터를 무작위로 생성

from imblearn.over_sampling import RandomOverSampler

sample_model = RandomOverSampler()
X_resamp, Y_resamp = sample_model.fit_resample(X,Y)
X_resamp['Target'] = Y_resamp

Y_resamp.value_counts()
>>> Diagnosis
M    357
B    357
Name: count, dtype: int64

ADASYN (Adaptive Synethic Sampling) : SMOTE 기법을 보완하여, 생성되는 데이터에 노이즈를 부여해 더욱 사실적인 데이터를 생성

from imblearn.over_sampling import ADASYN

sample_model = ADASYN()
X_resamp, Y_resamp = sample_model.fit_resample(X,Y)
X_resamp['Target'] = Y_resamp

Y_resamp.value_counts()
>>> Diagnosis
M    362
B    357
Name: count, dtype: int64

Combining Sampling : Under Sampling + Over Sampling

SMOTE + Tomek

from imblearn.combine import SMOTETomek

sample_model = SMOTETomek()
X_resamp, Y_resamp = sample_model.fit_resample(X,Y)
X_resamp['Target'] = Y_resamp

Y_resamp.value_counts()
>>> Diagnosis
M    342
B    342
Name: count, dtype: int64

Imbalanced Data Sampling 기법을 적용한 Pipeline 구성

✔︎ 사용 데이터 불러오기
- 정형외과 병원 / 디스크 수술 후 환자들의 데이터

df2 = pd.read_csv('15_Data.csv')

✔︎ X,Y 설정

Y = df2['수술실패여부']
X = df2[['연령','체중','신장','수술기법','통증기간(월)','헤모글로빈수치']]

✔︎ 학습데이터셋과 검증데이터셋 나누기

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,random_state=1234)

numeric_list = X.describe().columns
category_list = X.describe(include='object').columns

from imblearn.pipeline import make_pipeline

numeric_pipe = make_pipeline(SimpleImputer(strategy='mean'), MinMaxScaler())
category_pipe = make_pipeline(SimpleImputer(strategy='most_frequent'),
                             OneHotEncoder())

✔︎ 파이프라인 구성

preprocess_pipe2 = make_column_transformer((numeric_pipe, numeric_list),
                       (category_pipe, category_list))

model_pipe2 = make_pipeline(preprocess_pipe2, SMOTE(), DecisionTreeClassifier())

✔︎ 학습

grid_model = GridSearchCV(model_pipe2, param_grid=hyperparameter,
            cv=3, n_jobs=-1, scoring='f1')
grid_model.fit(X_train, Y_train)
best_model = grid_model.best_estimator_

✔︎ 평가