Permutation Importance, Random Forest, and Confusion Matrix를 활용한 Feature Selection

Ryan·2025년 2월 5일

Python/Pandas

목록 보기

23/23

중요한 Feature 찾기

데이터 분석과 머신러닝 모델의 성능을 높이기 위해서는 가장 중요한 특징(Feature)을 선별하는 과정이 필수적입니다. 이번 스터디에서는 Permutation Importance, Random Forest, 그리고 Confusion Matrix를 활용하여 데이터에서 중요한 Feature를 찾는 방법을 학습합니다.

🔹 Permutation Importance

📌 Permutation Importance란?

Permutation Importance는 특정 Feature를 무작위로 섞어서 모델의 성능 변화가 얼마나 일어나는지를 측정하는 기법입니다.

모델이 학습한 Feature들을 사용하여 기본 성능을 평가
한 개의 Feature를 랜덤하게 섞어서 순서를 무작위로 변경
변경된 데이터를 모델에 입력하여 성능 변화를 측정
성능 변화가 클수록 중요한 Feature로 판단

이 방법은 모델의 Feature 중요도를 해석하는 데 유용하며, 특히 Random Forest, Gradient Boosting과 같은 트리 기반 모델과 함께 사용됩니다.

📌 Permutation Importance 실습 (Python 코드)

다음은 sklearn.inspection.permutation_importance()를 사용하여 Feature 중요도를 측정하는 코드입니다.

python
복사편집
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance

# 샘플 데이터 생성
np.random.seed(0)
X = pd.DataFrame(np.random.randn(100, 5), columns=['X1', 'X2', 'X3', 'X4', 'X5'])
y = np.random.choice([0, 1], size=100)  # 이진 분류 문제

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Random Forest 모델 학습
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(X_train, y_train)

# Permutation Importance 계산
perm_importance = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=0)

# 결과 출력
feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': perm_importance.importances_mean})
feature_importance.sort_values(by='Importance', ascending=False)

📌 결과 해석

Importance 값이 클수록 중요한 Feature
값이 낮거나 0에 가까운 Feature는 제거 가능

🔹 Random Forest를 이용한 Feature Importance

📌 Random Forest란?

Random Forest는 여러 개의 결정 트리(Decision Tree)를 조합하여 예측하는 앙상블 학습 기법입니다.

각 트리는 데이터 샘플을 랜덤하게 선택하여 학습
각 트리의 예측을 모아 평균 또는 다수결 방식으로 최종 예측
Feature 중요도를 자동으로 계산할 수 있음

이 방법을 사용하면 모델이 어떤 Feature를 많이 사용했는지를 분석하여 Feature Selection에 활용할 수 있습니다.

📌 Random Forest 실습 (Python 코드)

python
복사편집
# Random Forest를 이용한 Feature Importance 측정
rf_importance = pd.DataFrame({'Feature': X.columns, 'Importance': model.feature_importances_})
rf_importance.sort_values(by='Importance', ascending=False)

📌 결과 해석

Importance 값이 높은 Feature는 중요한 역할
값이 낮은 Feature는 제거 가능

🔹 Confusion Matrix를 이용한 모델 성능 평가

📌 Confusion Matrix란?

Confusion Matrix는 분류 모델의 성능을 평가할 때 사용하는 지표로, 예측값과 실제값의 비교 결과를 표로 정리한 것입니다.

True Class	Positive (실제)	Negative (실제)
Predicted Positive (예측)	TP (True Positive)	FP (False Positive)
Predicted Negative (예측)	FN (False Negative)	TN (True Negative)

TP (True Positive): 실제 Positive, 예측도 Positive
FP (False Positive): 실제 Negative, 예측은 Positive (오탐지)
FN (False Negative): 실제 Positive, 예측은 Negative (미탐지)
TN (True Negative): 실제 Negative, 예측도 Negative

Confusion Matrix를 활용하면 Precision, Recall, F1-score 등의 성능 평가 지표를 계산할 수 있습니다.

📌 Confusion Matrix 실습 (Python 코드)

python
복사편집
from sklearn.metrics import confusion_matrix, classification_report

# 예측값 생성
y_pred = model.predict(X_test)

# Confusion Matrix 출력
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

# Classification Report 출력
print("\nClassification Report:\n", classification_report(y_test, y_pred))

📌 결과 해석

Accuracy (정확도): (TP + TN) / 전체 샘플
Precision (정밀도): TP / (TP + FP) → 긍정 예측이 얼마나 정확한지
Recall (재현율): TP / (TP + FN) → 실제 긍정을 얼마나 잘 찾았는지
F1-score: Precision과 Recall의 조화 평균

📌 결론 및 정리

기법	목적	결과 해석
Permutation Importance	Feature 중요도 평가	Feature를 섞어서 모델 성능이 얼마나 변하는지 측정
Random Forest Feature Importance	모델 기반 Feature 선택	모델이 중요하게 여기는 Feature 분석
Confusion Matrix	분류 모델 성능 평가	Precision, Recall, F1-score 분석

👉 Permutation Importance와 Random Forest를 활용하여 데이터에서 중요한 Feature를 선별하고, Confusion Matrix로 모델 성능을 분석하는 것이 중요합니다.

Ryan

이전 포스트

Permutation Importance, Random Forest, and Confusion Matrix를 활용한 Feature Selection

Python/Pandas

중요한 Feature 찾기

🔹 Permutation Importance

📌 Permutation Importance란?

📌 Permutation Importance 실습 (Python 코드)

📌 결과 해석

🔹 Random Forest를 이용한 Feature Importance

📌 Random Forest란?

📌 Random Forest 실습 (Python 코드)

📌 결과 해석

🔹 Confusion Matrix를 이용한 모델 성능 평가

📌 Confusion Matrix란?

📌 Confusion Matrix 실습 (Python 코드)

📌 결과 해석

📌 결론 및 정리

중요 Feature 찾기 - 독립표본 T-검정 (Independent Two-Sample T-Test)와 카이제곱 검정 (Chi-Square Test)

0개의 댓글