특성추출

림이·2024년 5월 9일

python 데이터분석 머신러닝

머신러닝/딥러닝

목록 보기

4/5

특성 추출 (Feature Selection)

변수(X)들의 부분집합을 선택, 모델의 중요하지 않은 변수를 없애, 모델의 복잡도를 감소시키고 성능을 향상
[차원의 저주] : 항목의 수가 많을 때, 학습이 어려워지고, 학습에 사용되는 자원이 많아져 성능 저하 -> 특성 추출 기법으로 해결
- ANOVA 분석에서의 F 통계량의 계산식을 이용해 (같은 그룹 내 분산이 작고 다른 그룹간 분산이 큰 경우에 해당하는 변수를 찾아 선택) / F값 -> (그룹 내 분산이 감소) / (다른 그룹간 분산이 증가) / 변수들이 선형성을 따를 때
- 상호 정보량 (Mutual Information)을 이용해 변수를 선택 (Entropy)

[예제]

수술실패여부를 목표변수로 환자의 기본 특성을 입력했을 때, 해당 환자가 수술을 실패할 지/ 하지 않을지 분류모델을 만들고자 한다. 아래의 조건을 이용해 분류모델을 생성하시오.

- X값에는 환자의 기본 정보(신체적 정보 + 직업)와 질병력이 포함되어야 합니다.
- Y값은 "수술실패여부"
- DecisionTree Classifier를 이용해 학습을 수행하세요.
- 특성 공학 기법은 자유롭게 적용하여 모델을 구성해 주세요.
- **단 학습 데이터의 과적합이 발생하면 안되며, 학습 성능 F1 Score 기준 40%이상 나와야 합니다.** 
- 학습한 모델은 model_medical.sav로 저장해 주세요.

✔︎ 사용하는 데이터프레임 : df1 - 컬럼확인

df1.columns

>>> Index(['Column 1', '환자ID', '수술기법', '수술시간', '수술실패여부', '신장', '연령', '재발여부', '체중',
       '헤모글로빈수치', '환자통증정도', '통증기간(월)', '혈액형', '수술일', '입원일', '퇴원일',
       'Large Lymphocyte', 'Location of herniation', 'ODI', '가족력', '간질성폐질환',
       '고혈압여부', '과거수술횟수', '당뇨여부', '말초동맥질환여부', '빈혈여부', '성별', '스테로이드치료', '신부전여부',
       '신장_duplie', '심혈관질환', '암발병여부', '연령_duplie', '우울증여부', '입원기간', '종양진행여부',
       '직업', '체중_duplie', '헤모글로빈수치_duplie', '혈전합병증여부', '환자통증정도_duplie', '흡연여부',
       '통증기간(월)_duplie', '입원일_duplie', '퇴원일_duplie', 'BMI', '연령대'],
      dtype='object')

✔︎ 독립변수와 종속변수 구분

Y = df1['수술실패여부']
X = df1[['성별','연령','BMI','헤모글로빈수치','환자통증정도','통증기간(월)','혈액형','직업','흡연여부']]

✔︎ 필요 라이브러리 임포트

from sklearn.model_selection import train_test_split
from imblearn.pipeline import make_pipeline
from sklearn.compose   import make_column_transformer
from sklearn.impute    import KNNImputer
from sklearn.impute    import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from imblearn.combine  import SMOTETomek
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

✔︎ 학습 데이터셋과 검증 데이터셋 분리

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,random_state=1234)

✔︎ 연속형 변수와 범주형 변수 list 구분

numeric_list = X.describe().columns
category_list = X.describe(include='object').columns

✔︎ 파이프라인 구축

numeric_pipe  = make_pipeline( KNNImputer(), MinMaxScaler())
category_pipe = make_pipeline( SimpleImputer(strategy='most_frequent'),
                                OneHotEncoder())
prepro_pipe= make_column_transformer((numeric_pipe,numeric_list),
                                     (category_pipe,category_list))
make_pipeline(prepro_pipe, SMOTETomek(), DecisionTreeClassifier())

✔︎ 특성추출을 위한 라이브러리 임포트

from sklearn.feature_selection import SelectKBest # 특정 K 개만 선택 되도록 계산
from sklearn.feature_selection import f_classif # F 값을 집단 별로 계산

✔︎ 모델 파이프라인 선언

model_pipe = make_pipeline(prepro_pipe, SMOTETomek(), 
                      SelectKBest(f_classif,k=5), DecisionTreeClassifier())

✔︎ 하이퍼파라미터 설정 및 학습

hyperparameter = {'decisiontreeclassifier__max_depth':range(10,15),
                  'decisiontreeclassifier__min_samples_split':range(10,15)}
grid_model = GridSearchCV(model_pipe, param_grid=hyperparameter,cv=3, 
                           n_jobs=-1, scoring='f1')
grid_model.fit(X_train,Y_train)
best_model = grid_model.best_estimator_

✔︎ 모델 검증을 위한 함수

def eval_func(model):
    Y_train_pred = model.predict(X_train)
    Y_test_pred  = model.predict(X_test)
    print('학습성능')
    print(classification_report(Y_train, Y_train_pred))
    print('일반화 성능')
    print(classification_report(Y_test, Y_test_pred))

✔︎ 검증

eval_func(best_model)

>>> 학습성능
              precision    recall  f1-score   support

           0       0.95      0.93      0.94      1331
           1       0.25      0.33      0.28        89

    accuracy                           0.90      1420
   macro avg       0.60      0.63      0.61      1420
weighted avg       0.91      0.90      0.90      1420

일반화 성능
              precision    recall  f1-score   support

           0       0.94      0.90      0.92       448
           1       0.02      0.04      0.03        26

    accuracy                           0.85       474
   macro avg       0.48      0.47      0.47       474
weighted avg       0.89      0.85      0.87       474

✔︎ 변수선택법에 의해 결정된 항목을 확인

select_num = best_model['selectkbest'].get_support(indices=True)

select_num
>>> array([ 1,  4, 18, 23, 27], dtype=int64)

✔︎ 특성공학 처리 이후에 도출된 Column항목 확인

pipe_col_list = best_model['columntransformer'].get_feature_names_out()
pipe_col_list

✔︎ 처리된 Column에 번호를 붙여 Dictionary형태로 선언

pipe_col_dict = dict(zip(range(0,len(pipe_col_list)) , pipe_col_list))

pipe_col_dict.get
>>> <function dict.get(key, default=None, /)>

✔︎ Select 된 변수를 확인

list(map(pipe_col_dict.get ,  select_num))
>>> ['pipeline-1__연령',
	 'pipeline-1__환자통증정도',
 	 'pipeline-2__직업_사무직',
 	 'pipeline-2__직업_의료직',
 	 'pipeline-2__직업_학생']

림이

Data Scientist

이전 포스트

교차검증, 하이퍼파라미터, 불균형 처리

다음 포스트