코드스테이츠 AI Bootcamp Section2에서 자유주제로 진행한 Machine Learning 개인프로젝트 내용 정리 및 회고. (1) 서론과 데이터준비 과정
자유 주제
로 직접 선택한 데이터셋을 사용머신러닝 예측 모델
을 통한 성능 및 인사이트를 도출/공유
전처리/EDA
부터 모델을 해석하는 과정
까지 수행비데이터 직군
이라 가정Introduction 서론
고혈압
고혈압 진단기준
문제정의 및 목표
선정 데이터셋
타겟과 특성
Data Preparation 데이터 준비
데이터 전처리
최종데이터
Modeling 모델링
데이터 분리
기준모델 및 평가지표
기본값으로 모델링
모델 튜닝
최종모델 및 테스트
Interpretation 해석
특성중요도
순열중요도
특성 분석
Conclusion 결론
요약
결론 및 한계점
회고
침묵의 살인자
뚜렷한 원인이 밝혀지지 않은 고혈압
140mmHg이상
OR 이완기혈압 90mmHg 이상
이라는 진단 기준으로 진단되어 왔음.130mmHg이상
OR 이완기혈압 80mmHg 이상
으로 강화하였음.140mmHg이상
OR 이완기혈압 90mmHg 이상
을 유지하는 보수적인 입장을 취해오고 있음.고혈압 고위험군을 선별하고 고혈압을 예방하는 것
은 중요한 문제인공지능(AI) 머신러닝
을 통한 예측 모델 개발 및 요인 분석의 필요성각각의 진단기준에 따른 예측모델 개발 및 비교분석
의 필요성import pandas as pd
# from pandas_profiling import ProfileReport
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'Malgun Gothic'
plt.rcParams['axes.unicode_minus'] = False
import seaborn as sns
raw_2019 = pd.read_csv('data/2019_utf8.csv')
raw_2019.shape
#output
'''
(8110, 831)
'''
raw_2020 = pd.read_csv('data/2020_utf8.csv')
raw_2020.shape
#output
'''
(7359, 762)
'''
demo_list = ['ID',
'year',
'sex',
'age']
ques_list = ['DI1_2',
'DI2_2',
'DE1_31',
'DE1_32',
'DE1_dg',
'BD1_11',
'BD2_1',
'sm_presnt']
exam_list = ['HE_HPfh1',
'HE_HPfh2',
'HE_sbp',
'HE_dbp',
'HE_wt',
'HE_wc',
'HE_BMI',
'HE_glu',
'HE_HbA1c',
'HE_chol',
'HE_TG']
column_list = demo_list+ques_list+exam_list
raw2_2019 = raw_2019[column_list]
raw2_2020 = raw_2020[column_list]
print(raw2_2019.shape)
print(raw2_2020.shape)
#output
'''
(8110, 23)
(7359, 23)
'''
raw_data = pd.concat([raw2_2019,raw2_2020],ignore_index=True)
raw_data.shape
#output
'''
(15469, 23)
'''
.
으로 표기되어있어 np.nan
값으로 변환하였음.#함수정의
def fillnan(value):
if value == '.':
value = np.nan
return value
#함수적용
nanfill = raw_data.copy()
filled = nanfill.applymap(fillnan)
#확인
filled.info()
#output
'''
RangeIndex: 15469 entries, 0 to 15468
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 15469 non-null object
1 year 15469 non-null int64
2 sex 15469 non-null int64
3 age 15469 non-null int64
4 DI1_2 14802 non-null object
5 DI2_2 14802 non-null object
6 DE1_31 14802 non-null object
7 DE1_32 14802 non-null object
8 DE1_dg 14802 non-null object
9 BD1_11 14802 non-null object
10 BD2_1 14802 non-null object
11 sm_presnt 12048 non-null object
12 HE_HPfh1 13409 non-null object
13 HE_HPfh2 13409 non-null object
14 HE_sbp 13479 non-null object
15 HE_dbp 13479 non-null object
16 HE_wt 14760 non-null object
17 HE_wc 14073 non-null object
18 HE_BMI 14659 non-null object
19 HE_glu 13101 non-null object
20 HE_HbA1c 13097 non-null object
21 HE_chol 13101 non-null object
22 HE_TG 13101 non-null object
dtypes: int64(3), object(20)
memory usage: 2.7+ MB
'''
#숫자형 변환함수 적용
fix1 = filled.copy()
fix1.iloc[:,1:] = fix1.iloc[:,1:].apply(pd.to_numeric)
fix1.info()
#output
'''
RangeIndex: 15469 entries, 0 to 15468
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 15469 non-null object
1 year 15469 non-null int64
2 sex 15469 non-null int64
3 age 15469 non-null int64
4 DI1_2 14802 non-null float64
5 DI2_2 14802 non-null float64
6 DE1_31 14802 non-null float64
7 DE1_32 14802 non-null float64
8 DE1_dg 14802 non-null float64
9 BD1_11 14802 non-null float64
10 BD2_1 14802 non-null float64
11 sm_presnt 12048 non-null float64
12 HE_HPfh1 13409 non-null float64
13 HE_HPfh2 13409 non-null float64
14 HE_sbp 13479 non-null float64
15 HE_dbp 13479 non-null float64
16 HE_wt 14760 non-null float64
17 HE_wc 14073 non-null float64
18 HE_BMI 14659 non-null float64
19 HE_glu 13101 non-null float64
20 HE_HbA1c 13097 non-null float64
21 HE_chol 13101 non-null float64
22 HE_TG 13101 non-null float64
dtypes: float64(19), int64(3), object(1)
memory usage: 2.7+ MB
'''
#drop
fix2 = fix1.copy()
fix2 = fix2.dropna().reset_index(drop=True)
fix2.info()
#output
'''
RangeIndex: 11573 entries, 0 to 11572
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 11573 non-null object
1 year 11573 non-null int64
2 sex 11573 non-null int64
3 age 11573 non-null int64
4 DI1_2 11573 non-null float64
5 DI2_2 11573 non-null float64
6 DE1_31 11573 non-null float64
7 DE1_32 11573 non-null float64
8 DE1_dg 11573 non-null float64
9 BD1_11 11573 non-null float64
10 BD2_1 11573 non-null float64
11 sm_presnt 11573 non-null float64
12 HE_HPfh1 11573 non-null float64
13 HE_HPfh2 11573 non-null float64
14 HE_sbp 11573 non-null float64
15 HE_dbp 11573 non-null float64
16 HE_wt 11573 non-null float64
17 HE_wc 11573 non-null float64
18 HE_BMI 11573 non-null float64
19 HE_glu 11573 non-null float64
20 HE_HbA1c 11573 non-null float64
21 HE_chol 11573 non-null float64
22 HE_TG 11573 non-null float64
dtypes: float64(19), int64(3), object(1)
memory usage: 2.0+ MB
'''
target_list = ['DI1_2','HE_sbp','HE_dbp']
target_df = fix2[target_list]
target_df.head()
#함수정의
bp_drug = []
for i in fix2.DI1_2:
if i < 5:
bp_drug.append(1)
else:
bp_drug.append(0)
bp_drug = np.array(bp_drug)
#적용
target_df2 = target_df.copy()
target_df2['bp_drug'] = bp_drug
target_df2.head()
#함수정의
hbp_us = []
for i in range(len(target_df2)):
if (target_df2.loc[i,'HE_sbp']<130) & (target_df2.loc[i,'HE_dbp']<80) & (target_df2.loc[i,'bp_drug']==0):
hbp_us.append(0)
else:
hbp_us.append(1)
hbp_us = np.array(hbp_us)
#적용
target_df3 = target_df2.copy()
target_df3['HBP_US'] = hbp_us
target_df3.head()
#함수정의
hbp_eu = []
for i in range(len(target_df2)):
if (target_df2.loc[i,'HE_sbp']<140) & (target_df2.loc[i,'HE_dbp']<90) & (target_df2.loc[i,'bp_drug']==0):
hbp_eu.append(0)
else:
hbp_eu.append(1)
hbp_eu = np.array(hbp_eu)
#적용
target_df4 = target_df3.copy()
target_df4['HBP_EU'] = hbp_eu
target_df4.head()
#confusion metrix
from sklearn.metrics import confusion_matrix
cfm = confusion_matrix(target_df4.HBP_US,target_df4.HBP_EU)
#그룹명, 빈도, 백분율값
group_names = ['정상','-','당뇨병 전단계','당뇨병']
group_counts = ['{0:0.0f}'.format(value) for value in cfm.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in cfm.flatten()/np.sum(cfm)]
#label로 묶기, 축 label 설정
labels = [f'{v1}\n\n{v2}\n\n{v3}' for v1,v2,v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
tick = ['정상','고혈압']
#Heatmap 그리기
sns.heatmap(cfm, annot=labels, fmt='',cmap='Purples',xticklabels=tick,yticklabels=tick)
plt.xlabel('EU criteria')
plt.ylabel('US criteria')
plt.title('진단기준에 따른 분포\n[US : 130mmHg/80mmHg]\n[EU : 140mmHg/90mmHg]')
plt.show()
feature_df = fix2.drop(columns=target_list,axis=1)
feature_df.shape
#output
'''
(11573, 20)
'''
feature_df0 = feature_df[demo_list]
feature_df0.columns = ['ID','year','sex','age']
feature_df0.sample(5,random_state=42)
#list 생성
heavy_drink = []
for i in range(len(feature_df)):
if (feature_df.loc[i,'sex']==1) & (feature_df.loc[i,'BD1_11'] in [3,4,5,6]) & (feature_df.loc[i,'BD2_1'] in [4,5]):
heavy_drink.append(1)
elif (feature_df.loc[i,'sex']==2) & (feature_df.loc[i,'BD1_11'] in [3,4,5,6]) & (feature_df.loc[i,'BD2_1'] in [3,4,5]):
heavy_drink.append(1)
else:
heavy_drink.append(0)
heavy_drink = np.array(heavy_drink)
#column 추가
feature_df1 = feature_df0.copy()
feature_df1['heavy_drink'] = heavy_drink
feature_df1.sample(5,random_state=42)
#column 추가
feature_df2 = feature_df1.copy()
feature_df2['smoke'] = feature_df['sm_presnt'].astype(int)
feature_df2.sample(5,random_state=42)
#list 생성
genetic = []
for i in range(len(feature_df)):
if (feature_df.loc[i,'HE_HPfh1']==1) & (feature_df.loc[i,'HE_HPfh2']==1):
genetic.append(2)
elif (feature_df.loc[i,'HE_HPfh1']==1) & (feature_df.loc[i,'HE_HPfh2']!=1):
genetic.append(1.5)
elif (feature_df.loc[i,'HE_HPfh1']!=1) & (feature_df.loc[i,'HE_HPfh2']==1):
genetic.append(1.5)
else:
genetic.append(1)
genetic = np.array(genetic)
#column 추가
feature_df3 = feature_df2.copy()
feature_df3['genetic_hbp'] = genetic
feature_df3.sample(5,random_state=42)
#column 추가
exam_list0 = ['HE_wt','HE_wc','HE_BMI']
exam_list1 = ['weight','waist','BMI']
feature_df4 = feature_df3.copy()
feature_df4[exam_list1] = feature_df[exam_list0].astype(float)
feature_df4.sample(5,random_state=42)
#list 생성
diabetes = []
for i in range(len(feature_df)):
if (feature_df.loc[i,'DE1_31']!=1) & (feature_df.loc[i,'DE1_32']!=1) & (feature_df.loc[i,'DE1_dg']!=1) & (feature_df.loc[i,'HE_glu']<126) & (feature_df.loc[i,'HE_HbA1c']<6.5):
diabetes.append(0)
else:
diabetes.append(1)
diabetes = np.array(diabetes)
#column 추가
feature_df5 = feature_df4.copy()
feature_df5['diabetes'] = diabetes
feature_df5.sample(5,random_state=42)
#list 생성
hyper_chol = []
for i in range(len(feature_df)):
if (feature_df.loc[i,'DI2_2'] > 4) & (feature_df.loc[i,'HE_chol'] < 240):
hyper_chol.append(0)
else:
hyper_chol.append(1)
hyper_chol = np.array(hyper_chol)
#column 추가
feature_df6 = feature_df5.copy()
feature_df6['hyper_chol'] = hyper_chol
feature_df6.sample(5,random_state=42)
#column 추가
feature_df7 = feature_df6.copy()
feature_df7['triglycerides'] = feature_df['HE_TG'].astype(float)
feature_df7.sample(5,random_state=42)
#column 추가
fix_data0 = feature_df7.copy()
fix_data0[['HBP_US','HBP_EU']] = target_df4[['HBP_US','HBP_EU']]
fix_data0.info()
#output
'''
RangeIndex: 11573 entries, 0 to 11572
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 11573 non-null object
1 year 11573 non-null int64
2 sex 11573 non-null int64
3 age 11573 non-null int64
4 heavy_drink 11573 non-null int32
5 smoke 11573 non-null int32
6 genetic_hbp 11573 non-null float64
7 weight 11573 non-null float64
8 waist 11573 non-null float64
9 BMI 11573 non-null float64
10 diabetes 11573 non-null int32
11 hyper_chol 11573 non-null int32
12 triglycerides 11573 non-null float64
13 HBP_US 11573 non-null int32
14 HBP_EU 11573 non-null int32
dtypes: float64(5), int32(6), int64(3), object(1)
memory usage: 1.1+ MB
'''
corr0 = fix_data0.corr(method='pearson')
corr0.iloc[1:,-2:].style.background_gradient(cmap='coolwarm',vmin=-1,vmax=1)
check_features0 = fix_data0.iloc[:,2:-2]
feat_corr0 = check_features0.corr(method='pearson')
feat_corr0.style.background_gradient(cmap='coolwarm',vmin=-1,vmax=1)
plt.figure(figsize=(12,8))
sns.heatmap(feat_corr0.round(2),
cmap='coolwarm',
annot=True,
linewidths=.5,
vmin=-1,
vmax=1)
plt.show()
drop_list = ['weight','waist']
fix_data = fix_data0.copy()
fix_data = fix_data.drop(columns=drop_list,axis=1)
fix_data.info()
#output
'''
RangeIndex: 11573 entries, 0 to 11572
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 11573 non-null object
1 year 11573 non-null int64
2 sex 11573 non-null int64
3 age 11573 non-null int64
4 heavy_drink 11573 non-null int32
5 smoke 11573 non-null int32
6 genetic_hbp 11573 non-null float64
7 BMI 11573 non-null float64
8 diabetes 11573 non-null int32
9 hyper_chol 11573 non-null int32
10 triglycerides 11573 non-null float64
11 HBP_US 11573 non-null int32
12 HBP_EU 11573 non-null int32
dtypes: float64(3), int32(6), int64(3), object(1)
memory usage: 904.3+ KB
'''
plt.figure(figsize=(12,8))
sns.heatmap(corr.round(2),
cmap='coolwarm',
annot=True,
linewidths=.5,
vmin=-1,
vmax=1)
plt.show()
sns.histplot(x='triglycerides',data=fix_data)
plt.axvline(200,label='고중성지방혈증 기준 (200mg/dL 이상)',linestyle='--',color='red',alpha=0.5)
plt.axvline(500,label='이상치 기준 (500mg/dL 이상)',linestyle='-',color='red',alpha=1)
plt.legend()
plt.title('triglycerides')
plt.show()
sns.histplot(x='BMI',data=fix_data)
plt.axvline(25,label='1단계 비만(BMI 25 이상)',linestyle='--',color='red',alpha=0.4)
plt.axvline(30,label='2단계 비만(BMI 30 이상)',linestyle='--',color='red',alpha=0.6)
plt.axvline(35,label='3단계 비만(BMI 35 이상)',linestyle='--',color='red',alpha=0.8)
plt.axvline(40,label='이상치 기준 (BMI 40 이상)',linestyle='-',color='red',alpha=1)
plt.legend()
plt.title('BMI')
plt.show()
#이상치 쿼리
outlier_BMI = fix_data.query('BMI >= 40')
outlier_chol = fix_data1.query('triglycerides >= 500')
# drop
fix_data1 = fix_data.drop(index=outlier_BMI.index).reset_index(drop=True)
fix_data2 = fix_data1.drop(index=outlier_chol.index).reset_index(drop=True)
#shape 변화확인
fix_data.shape
fix_data2.shape
#output
'''
(11573, 13)
(11422, 13)
'''
final_data = fix_data2.copy()
final_data.info()
#output
'''
RangeIndex: 11422 entries, 0 to 11421
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 11422 non-null object
1 year 11422 non-null int64
2 sex 11422 non-null int64
3 age 11422 non-null int64
4 heavy_drink 11422 non-null int32
5 smoke 11422 non-null int32
6 genetic_hbp 11422 non-null float64
7 BMI 11422 non-null float64
8 diabetes 11422 non-null int32
9 hyper_chol 11422 non-null int32
10 triglycerides 11422 non-null float64
11 HBP_US 11422 non-null int32
12 HBP_EU 11422 non-null int32
dtypes: float64(3), int32(6), int64(3), object(1)
memory usage: 892.5+ KB
'''
cat_cols = ['sex','heavy_drink','smoke','genetic_hbp','diabetes','hyper_chol']
for i in cat_cols:
ax = sns.countplot(x=i,data=final_data)
ax.bar_label(ax.containers[0])
plt.title(i)
plt.show()
num_cols = ['age','BMI','triglycerides']
for i in num_cols:
sns.histplot(x=i,data=final_data)
plt.title(i)
plt.show()
#feature 데이터 정의, 상관관계 DataFrame 생성
final_feature = final_data.drop(["ID","year","HBP_US","HBP_EU"],axis=1)
corr_final = final_feature.corr(method='pearson')
#plotting
fig, ax = plt.subplots(figsize=(12,8))
mask = np.zeros_like(corr_final,dtype=bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr_final,
cmap='coolwarm',
annot=True,
mask=mask,
linewidths=.5,
cbar_kws={'shrink':.5},
vmin=-1,
vmax=1)
plt.show()
불균형은 없다 판단되어 이후 모델링에선 따로 전처리과정을 진행하지 않아도 될 것이라 판단함.
진단기준에 따른 분포 (sklearn matplotlib seaborn)
#confusion matrix
cfm_final = confusion_matrix(final_data.HBP_US,final_data.HBP_EU)
#표시할 값들 지정
group_names = ['정상','-','당뇨병 전단계','당뇨병']
group_counts = ['{0:0.0f}'.format(value) for value in cfm_final.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in cfm_final.flatten()/np.sum(cfm_final)]
labels = [f'{v1}\n\n{v2}\n\n{v3}' for v1,v2,v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
tick = ['정상','고혈압']
#히트맵 plot
sns.heatmap(cfm_final, annot=labels, fmt='',cmap='Purples',xticklabels=tick,yticklabels=tick)
plt.xlabel('EU criteria')
plt.ylabel('US criteria')
plt.title('진단기준에 따른 분포\n[US : 130mmHg/80mmHg]\n[EU : 140mmHg/90mmHg]')
plt.show()
count_us = final_data.HBP_US.value_counts().sort_index()
count_label = ['정상(0)','고혈압(1)']
plt.pie(x = count_us,labels=count_label,autopct='%.2f%%',startangle=90)
plt.legend(loc = 'upper right')
plt.title('고혈압 비율\n[US criteria]')
plt.show()
count_eu = final_data.HBP_EU.value_counts().sort_index()
count_label = ['정상(0)','고혈압(1)']
plt.pie(x = count_eu,labels=count_label,autopct='%.2f%%',startangle=90)
plt.legend(loc = 'upper right')
plt.title('고혈압 비율\n[EU criteria]')
plt.show()
final_data.to_csv('data/KNHANES_8th_final2.csv',index=False)