코드스테이츠 AI Bootcamp Section2에서 자유주제로 진행한 Machine Learning 개인프로젝트 내용 정리 및 회고. (3) Model 해석, 결론, 회고
자유 주제
로 직접 선택한 데이터셋을 사용머신러닝 예측 모델
을 통한 성능 및 인사이트를 도출/공유
전처리/EDA
부터 모델을 해석하는 과정
까지 수행비데이터 직군
이라 가정Introduction 서론
고혈압
고혈압 진단기준
문제정의 및 목표
선정 데이터셋
타겟과 특성
Data Preparation 데이터 준비
데이터 전처리
최종데이터
Modeling 모델링
데이터 분리
기준모델 및 평가지표
기본값으로 모델링
모델 튜닝
최종모델 및 테스트
Interpretation 해석
특성중요도
순열중요도
특성 분석
Conclusion 결론
요약
결론 및 한계점
회고
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'Malgun Gothic'
plt.rcParams['axes.unicode_minus'] = False
import seaborn as sns
#모델 학습 및 검증점수 일치여부 확인을 위한 라이브러리
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, confusion_matrix
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold
#모델 해석을 위한 라이브러리
import eli5
from eli5.sklearn import PermutationImportance
from pdpbox.pdp import pdp_isolate, pdp_plot, pdp_interact, pdp_interact_plot
df0 = pd.read_csv('data/KNHANES_8th_final2.csv')
df1 = df0.iloc[:,2:]
df1.info()
#output
'''
RangeIndex: 11422 entries, 0 to 11421
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sex 11422 non-null int64
1 age 11422 non-null int64
2 heavy_drink 11422 non-null int64
3 smoke 11422 non-null int64
4 genetic_hbp 11422 non-null float64
5 BMI 11422 non-null float64
6 diabetes 11422 non-null int64
7 hyper_chol 11422 non-null int64
8 triglycerides 11422 non-null float64
9 HBP_US 11422 non-null int64
10 HBP_EU 11422 non-null int64
dtypes: float64(3), int64(8)
memory usage: 981.7 KB
'''
corr_test = X_test.corr()
fig, ax = plt.subplots(figsize=(12,8))
mask = np.zeros_like(corr_test,dtype=bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr_test,
cmap='coolwarm',
annot=True,
mask=mask,
linewidths=.5,
cbar_kws={'shrink':.5},
vmin=-1,
vmax=1)
plt.title('Correlation of features (X_test)')
plt.show()
importances_us = pd.DataFrame(data={'FeatureImportance(US)':xgb_us_tun.feature_importances_},index=X_train.columns)
display(importances_us.sort_values(by='FeatureImportance(US)',ascending=False).style.background_gradient(cmap='Reds'))
importances_us.sort_values(by='FeatureImportance(US)').plot.barh(color='red',alpha=0.75)
plt.title('Feature Importance(MDI)\n[US criteria]')
plt.legend(loc='lower right')
plt.show()
importances_eu = pd.DataFrame(data={'FeatureImportance(EU)':xgb_eu_tun.feature_importances_},index=X_train.columns)
display(importances_eu.sort_values(by='FeatureImportance(EU)',ascending=False).style.background_gradient(cmap='Blues'))
importances_eu.sort_values(by='FeatureImportance(EU)').plot.barh(color='blue',alpha=0.75)
plt.title('Feature Importance(MDI)\n[EU criteria]')
plt.legend(loc='lower right')
plt.show()
permuter_us = PermutationImportance(
xgb_us_tun,
scoring='roc_auc',
n_iter=10,
random_state=42
)
permuter_us.fit(X_test,y_test_us)
features_us = X_test.columns.to_list()
display(
eli5.show_weights(
permuter_us,
top=None,
feature_names=features_us
))
pi_us = pd.Series(permuter_us.feature_importances_,features_us).sort_values()
pi_us.plot.barh(color='red',alpha=0.75,label='US criteria')
plt.legend(loc='lower right')
plt.title('Permutation Importance\n[US criteria]')
plt.show()
permuter_eu = PermutationImportance(
xgb_eu_tun,
scoring='roc_auc',
n_iter=10,
random_state=42
)
permuter_eu.fit(X_test,y_test_eu)
features_eu = X_test.columns.to_list()
display(
eli5.show_weights(
permuter_eu,
top=None,
feature_names=features_eu
))
pi_eu = pd.Series(permuter_eu.feature_importances_,features_eu).sort_values()
pi_eu.plot.barh(color='blue',alpha=0.75,label='US criteria')
plt.legend(loc='lower right')
plt.title('Permutation Importance\n[EU criteria]')
plt.show()
pdp_age_us = pdp_isolate(
model=xgb_us_tun,
dataset=X_test,
model_features=X_test.columns,
feature='age',
grid_type='percentile',
cust_grid_points=[20,30,40,50,60,70,80]
)
pdp_plot(pdp_age_us,
'Age',
plot_pts_dist=True,
figsize=(12,8),
plot_params={
'title':'PDP for feature "Age"\n[US criteria]',
'subtitle':'',
'fill_color':'red',
'fill_alpha':0.1,
'pdp_color':'red'
})
plt.show()
pdp_age_eu = pdp_isolate(
model=xgb_eu_tun,
dataset=X_test,
model_features=X_test.columns,
feature='age',
grid_type='percentile',
cust_grid_points=[20,30,40,50,60,70,80]
)
pdp_plot(pdp_age_eu,
'Age',
plot_pts_dist=True,
figsize=(12,8),
plot_params={
'title':'PDP for feature "Age"\n[EU criteria]',
'subtitle':'',
'fill_color':'blue',
'fill_alpha':0.1,
'pdp_color':'blue'
})
plt.show()
pdp_sex_us = pdp_isolate(
model=xgb_us_tun,
dataset=X_test,
model_features=X_test.columns,
feature='sex',
grid_type='percentile',
cust_grid_points=[1,2]
)
pdp_plot(pdp_sex_us,
'Sex\n[ 1 : Man , 2 : Woman ]',
figsize=(12,8),
plot_params={
'title':'PDP for feature "Sex"\n[US criteria]',
'subtitle':'',
'fill_color':'red',
'fill_alpha':0.1,
'pdp_color':'red'
})
plt.show()
pdp_sex_eu = pdp_isolate(
model=xgb_eu_tun,
dataset=X_test,
model_features=X_test.columns,
feature='sex',
grid_type='percentile',
cust_grid_points=[1,2]
)
pdp_plot(pdp_sex_eu,
'Sex\n[ 1 : Man , 2 : Woman ]',
plot_pts_dist=True,
figsize=(12,8),
plot_params={
'title':'PDP for feature "Sex"\n[EU criteria]',
'subtitle':'',
'fill_color':'blue',
'fill_alpha':0.1,
'pdp_color':'blue'
})
plt.show()
pdp_gen_us = pdp_isolate(
model=xgb_us_tun,
dataset=X_test,
model_features=X_test.columns,
feature='genetic_hbp',
grid_type='percentile',
cust_grid_points=[1,1.5,2]
)
pdp_plot(pdp_gen_us,
'Genetic_HBP\n[ 1 : No , 1.5 : Either , 2 : Both ]',
plot_pts_dist=True,
figsize=(12,8),
plot_params={
'title':'PDP for feature "Genetic_HBP"\n[US criteria]',
'subtitle':'',
'fill_color':'red',
'fill_alpha':0.1,
'pdp_color':'red'
})
plt.show()
pdp_gen_eu = pdp_isolate(
model=xgb_eu_tun,
dataset=X_test,
model_features=X_test.columns,
feature='genetic_hbp',
grid_type='percentile',
cust_grid_points=[1,1.5,2]
)
pdp_plot(pdp_gen_eu,
'Genetic_HBP\n[ 1 : No , 1.5 : Either , 2 : Both ]',
plot_pts_dist=True,
figsize=(12,8),
plot_params={
'title':'PDP for feature "Genetic_HBP"\n[EU criteria]',
'subtitle':'',
'fill_color':'blue',
'fill_alpha':0.1,
'pdp_color':'blue'
})
plt.show()
소스코드는 위에서 실시한 조절불가요인 PDP와 거의 동일하므로 그래프만 올리도록 하겠음.
중성지방 Triglycerides [연속형변수]
고콜레스테롤혈증 Hypercholesterolemia [범주형변수]
당뇨병 Diabetes [범주형변수]
소스코드는 위에서 실시한 조절불가요인 PDP와 거의 동일하므로 그래프만 올리도록 하겠음.
비만도 BMI [연속형변수]
폭음 Heavy Drink [범주형변수]