[TIL]2025-04-22 (심화프로젝트)

yeyeyeyeye·2025년 4월 22일

TIL 프로젝트

TIL

목록 보기

4/18

심화프로젝트: XGBoost를 활용한 LendingClub 대출 상환 예측 모델 개선

✅ 1. XGBoost 기본 모델 적용

XGBClassifier 기본 파라미터로 훈련
성능지표:
- Accuracy: 0.65
- Confusion Matrix: [[33332 15831], [73377 129092]]
- ROC-AUC: 0.7168
클래스 0 (상환 실패)에 대한 재현율이 상대적으로 우수

✅ 2. 데이터 불균형 해결: SMOTE 적용

imblearn.over_sampling.SMOTE 사용하여 minority class를 oversampling
클래스 비율 균등화 → 모델 학습 전 더 균형잡힌 데이터 구성

✅ 3. 하이퍼파라미터 튜닝 (RandomizedSearchCV)

사용 파라미터:

{
    'n_estimators': 200,
    'max_depth': 8,
    'learning_rate': 0.05,
    'colsample_bytree': 0.6,
    'subsample': 1.0
}

성능 개선:
- Accuracy: 0.81
- Confusion Matrix: [[3609 45554], [2791 199678]]
- ROC-AUC: 0.7189
정확도 및 AUC 향상, 그러나 클래스 0 예측 약화됨 (재현율 ↓)

✅ 4. 변수 제거 실험

grade, installment 제거 후 모델 재학습
큰 성능 차이는 없었으나, 모델의 해석 가능성 제고를 위해 시도

✅ 5. 모델 평가 시각화 및 중요 변수 해석

xgb.feature_importances_를 통해 주요 변수 시각화
상위 변수: grade, term, home_ownership, verification_status, purpose 등

6. 코드

from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.model_selection import train_test_split

# grade 제거
X3_reduced = X3.drop(columns=['grade','installment'])

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X3_reduced, y, test_size=0.2, random_state=42, stratify=y)

# 모델 학습
xgb = XGBClassifier(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    scale_pos_weight=(y_train.value_counts()[0] / y_train.value_counts()[1]),
    random_state=42,
    eval_metric='logloss',
    use_label_encoder=False
)
xgb.fit(X_train, y_train)

# 예측
y_pred = xgb.predict(X_test)
y_pred_proba = xgb.predict_proba(X_test)[:, 1]

# 평가
print("📦 Confusion Matrix")
print(confusion_matrix(y_test, y_pred))

print("\n📊 Classification Report")
print(classification_report(y_test, y_pred))

print(f"\n🎯 ROC-AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")