Permutation Importance

yuns_u·2021년 8월 29일

기본 특성 중요도는 계산하기 용이하고 빠르지만 특성 종류에 따라 부정확한 결과가 나올 수 있기 때문에 주의가 필요하다.
순열 중요도(permutation importance)를 사용하면 더욱 정확한 계산이 가능하다.

세 가지 특성 중요도 계산을 살펴보자.

Feature Importance(Mean decrease impurity, MDI, 기본 특성 중요도)

sklearn 트리 기반 분류기에서 디폴트로 사용되는 특성 중요도는 속도는 빠르지만 결과를 주의해서 봐야한다. high cardinality(unique value가 많으면) 결과가 정확하지 않을 수 있기 때문이다. high cardinality(범주가 많음)이면 전체적인 일반화보다 범주가 많은 것들에 편향되어서 과적합, 불순도 값이 많이 나와 잘못 해석할 수 있기 때문이다.

각각 특성을 모든 트리에 대해 평균분순도감소(mean decrease impurity)를 계산한 값이다.

# 특성 중요도
rf = pipe.named_steps['randomforestclassifier']
importances = pd.Series(rf.feature_importances_, X_train.columns)

%matplotlib inline
import matplotlib.pyplot as plt

n = 20
plt.figure(figsize=(10, n/2))
plt.title(f'Top {n} features')

importances.sort_values()[-n:].plot.barh();

Drop-Column Importance(Drop-Column 중요도)

이론적으로 가장 좋아보이는 방법이지만 매 특성을 drop한 다음 다시 fit해야하기 때문에 매우 느리다. 특성이 n개가 존재한다면 n+1번의 학습이 필요하다.

column = 'opinion_seas_risk'

#opinion_h1n1_risk 없이 fit
pipe = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=100, random_state=2, n_jobs=-1)
)
pipe.fit(X_train.drop(columns=column), y_train)
score_without = pipe.score(X_val.drop(columns=column),y_val)
print(f'검증 정확도({column} 제외): {score_without}')

#opinion_h1n1_risk 포함 후 다시 학습
pipe = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=100, random_state=2, n_jobs=-1)
)
pipe.fit(X_train, y_train)
score_with = pipe.score(X_val, y_val)
print(f'검증 정확도({column} 포함): {score_with}')

#opinion_h1n1_risk 포함 전후의 정확도 차이 계산하기
print(f'{column}의 Drop-Column 중요도: {score_with - score_without}')

Permutation Importance(Mean Decreas Accuracy, MDA, 순열 중요도)

순열 중요도는 기본 특성 중요도와 Drop-Column 중요도 중간에 위치하는 특징을 가진다고 볼 수 있다.
순열 중요도 측정은 관심있는 특성에만 무작위로 노이즈를 주고 예측할 때 성능 평가지표(정확도, F1-score, R2 등)가 얼마나 감소하는지를 측정한다.
Drop-Column 중요도를 계산하기 위해 재학습을 해야했다면, 순열중요도는 검증데이터에서 각 특성을 제거하지않고 특성값에 무작위로 노이즈를 주어 기존 정보를 제거하여 특성이 기존에 하던 역할을 하지 못하게 하고 성능을 측정한다. 이 때 노이즈를 주는 가장 간단한 방법이 그 특성값들을 샘플 내에서 섞는 것(shuffle, permutation)이다.

eli5 라이브러리의 permutation_importance

직접 순열중요도 계산하기


# 변경할 특성 선택하기
feature = 'opinion_seas_risk'
X_val[feature].head()

# 특성의 값을 무작위로 섞기, 특성값의 순서는 바뀌지만 분포가 바뀌지 않음.
X_val_permuted = X_val.copy()
X_val_permuted[feature] = np.random.RandomState(seed=7).permutation(X_val_permuted[feature])

# 순열 중요도 값 얻기 (재학습이 필요없다.)
score_permuted = pipe.score(X_val_permuted, y_val)

print(f'검증 정확도({feature}): {score_with}')
print(f'검증 정확도 (permuted "{feature}"): {score_permuted}')
print(f'순열 중요도: {score_with - score_permuted}')

# 다른 feature(doctor_recc_h1n1)에 대해서 순열 중요도 계산해보기
feature = 'doctor_recc_seasonal'
X_val_permuted = X_val.copy()
X_val_permuted[feature] = np.random.permutation(X_val_permuted[feature])
score_permuted = pipe.score(X_val_permuted, y_val)

print(f'검증 정확도({feature}): {score_with}')
print(f'검증 정확도 (permuted "{feature}"): {score_permuted}')
print(f'순열 중요도: {score_with - score_permuted}')

eli5 라이브러리를 사용해서 순열 중요도 계산하기

모델 만들기

from sklearn.pipeline import Pipeline

#encoder와 imputer를 preprocessing으로 묶는다. 나중에 eli5 permutation 계산에 사용된다.
pipe = Pipeline([
    ('preprocessing', make_pipelilne(OrdinalEncoder(), SimpleImputer())),
    ('rf', RandomForestClassifier(n_estimator=100, random_state = 2, n_jobs = -1))
])

pipe.fit(X_train, y_train)
print('검증 정확도: ', pipe.score(X_val, y_val))

순열 중요도 계산하기

import warnings
warnings.simplefilter(action='ignore', catagory=FutureWarning)

import eli5
from eli5.sklearn import PermutationImportance

#permuter 정의
permuter = PermutationImportance(
    pipe.named_steps['rf'],#model
    scoring = 'accuracy', #metric
    n_iter = 5, #다른 random seed를 사용해서 5번 반복
    random_state = 2
)

#permuter 계산은 preprocessing된 X_val을 사용한다.
X_val_transformed = pipe.named_steps['preprocessing'].transform(X_val)

#실제로 fit 의미보다는 score를 다시 계산하는 작업이다.
permuter,fit(X_val_transformed, y_val);

#score들을 나열해보기
feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values()

#특성별 score 확인
eli5.show_weights(
    permuter,
    top = None, #top n 지정 가능, None일 경우 모든 특성
    feature_names = feature_names #list 형식으로 넣어야 한다.
)

위 함수에 대한 결과값은 아래와 같이 나온다.

     Weight         	Feature
0.0714 ± 0.0047  	doctor_recc_seasonal
0.0438 ± 0.0034 	opinion_seas_vacc_effective
0.0410 ± 0.0032	  opinion_seas_risk
0.0077 ± 0.0028 	agegrp
0.0048 ± 0.0019 	opinion_seas_sick_from_vacc
0.0031 ± 0.0026 	health_worker
0.0028 ± 0.0020 	health_insurance
0.0012 ± 0.0043 	education_comp
-0.0001 ± 0.0025	household_children
-0.0002 ± 0.0028	raceeth4_i
-0.0005 ± 0.0025	inc_pov
-0.0007 ± 0.0016	rent_own_r
-0.0013 ± 0.0026	marital
-0.0014 ± 0.0008	child_under_6_months
-0.0015 ± 0.0029	behaviorals
-0.0015 ± 0.0028	census_msa
-0.0016 ± 0.0011	chronic_med_condition
-0.0018 ± 0.0021	behavioral_touch_face
-0.0021 ± 0.0016	behavioral_outside_home
-0.0022 ± 0.0009	behavioral_avoidance
-0.0024 ± 0.0010	behavioral_antiviral_meds
-0.0025 ± 0.0022	behavioral_large_gatherings
-0.0026 ± 0.0036	n_people_r
-0.0027 ± 0.0021	behavioral_wash_hands
-0.0029 ± 0.0018	state
-0.0029 ± 0.0028	sex_i
-0.0031 ± 0.0015	behavioral_face_mask
-0.0031 ± 0.0025	census_region
-0.0031 ± 0.0031	hhs_region
-0.0035 ± 0.0022	n_adult_r

❗️ 위 결과의 해석
Weight의 평균값: feature의 중요도
Weight의 std: 중요도의 안정성

중요도를 이용해서 특성 선택하기 (Feature selection)

중요도가 마이너스인 특성을 제외해도 성능은 거의 영향이 없으며 모델 학습 속도는 개선된다.


#최소 중요도 미만인 특성들 제외
minimum_importance = 0.001
mask = permuter.feature_importances_ > minimum_importance
features = X_train.columns[mask]
X_train_selected = X_train[features]
X_val_selected = X_val[features]

#pipeline 재정의
pipe = Pipeline([
    ('preprocessing', make_pipeline(OrdinalEncoder(), SimpleImputer())),
    ('rf',RandomForestClassifer(n_estimators=100, random_state=2, n_jobs=-1))
], verbose = 1)

pipe.fit(X_train_selected, y_train);

print('검증 정확도: ', pipe.score(X_val_selected, y_val))

yuns_u

💛 공부 블로그 💛

이전 포스트

작은 feature engineering

다음 포스트