[Kaggle] All Lending Club loan

ByungJik_Ohยท2025๋…„ 4์›” 18์ผ

[Kaggle]

๋ชฉ๋ก ๋ณด๊ธฐ
3/3
post-thumbnail


๐Ÿ’ก ๋ฌธ์ œ

๋Œ€์ถœ์ด ์Šน์ธ๋œ ๋Œ€์ถœ ์‹ ์ฒญ์ž์˜ ์‹ ์šฉ์ ์ˆ˜, ์†Œ๋“, ๋Œ€์ถœ ๊ธˆ์•ก, ์šฉ๋„, ์ง์—… ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์–ด๋–ค ๋Œ€์ถœ์ด ์—ฐ์ฒด๋˜๊ฑฐ๋‚˜ ๋ถ€์‹คํ™” ๋ ์ง€ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•œ๋‹ค.

๐Ÿ”ฅ ์˜ˆ์ธก์— ์‚ฌ์šฉํ•  ๋ชจ๋ธ : XGBoost Classification - ์ด์ง„๋ถ„๋ฅ˜


๐Ÿ“– ๋ฐ์ดํ„ฐ ์…‹

150๊ฐœ ๊ฐ€๋Ÿ‰์˜ feature์— ๋Œ€์ถœ ์‹ ์ฒญ์ž์˜ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์žˆ๋‹ค.


๐Ÿ“’ ์ฝ”๋“œ

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from xgboost import XGBClassifier

from sklearn.metrics import classification_report

๐Ÿ“š Raw Data Loading

df = pd.read_csv('C:\\education\\accepted_2007_to_2018Q4.csv', low_memory=False)
display(df)


๐Ÿ“š Feature Selection

df1 = df[['loan_amnt', 'term', 'int_rate', 'grade', 'emp_length', 'home_ownership', 'annual_inc', 'purpose', 'dti', 'loan_status']]
df1.shape # (2260701, 10)

์ž„์˜๋กœ ๊ฐ€์žฅ ์—ฐ๊ด€์„ฑ ์žˆ์–ด๋ณด์ด๋Š” ์ปฌ๋Ÿผ (์ข…์†๋ณ€์ˆ˜ ํฌํ•จ) 10๊ฐœ๋ฅผ ์„ ํƒํ•˜์˜€๋‹ค.


๐Ÿ“š ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

df2 = df1.dropna(how='any', inplace=False)
df2.shape # (2113644, 10)
df2.info()
# <class 'pandas.core.frame.DataFrame'>
# Index: 2113644 entries, 0 to 2260698
# Data columns (total 10 columns):
#  #   Column          Dtype  
# ---  ------          -----  
#  0   loan_amnt       float64
#  1   term            object 
#  2   int_rate        float64
#  3   grade           object 
#  4   emp_length      object 
#  5   home_ownership  object 
#  6   annual_inc      float64
#  7   purpose         object 
#  8   dti             float64
#  9   loan_status     object 
# dtypes: float64(4), object(6)
# memory usage: 177.4+ MB

๋ฐ์ดํ„ฐ๊ฐ€ 200๋งŒ๊ฐœ ์ด์ƒ ์ถฉ๋ถ„ํžˆ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ „์ฒด ๋ฐ์ดํ„ฐ์— ์ตœ๋Œ€ํ•œ ๊ฐœ์ž…์„ ํ•˜์ง€ ์•Š๋„๋ก ์‚ญ์ œ์ฒ˜๋ฆฌ ํ•˜์˜€๋‹ค.


๐Ÿ“š Data Preprocessing (term)

df2['term_binary'] = np.where(df2['term'] == ' 36 months', 0, 1)
df3 = df2.drop('term', axis=1, inplace=False)
df3.head()


'36 months', '60 months' ๋‘๊ฐ€์ง€ ๋ฐ์ดํ„ฐ๋กœ ์ด๋ฃจ์–ด์ง„ term ์ปฌ๋Ÿผ์„ ํ•™์Šต์„ ์œ„ํ•ด 0 ๋˜๋Š” 1๋กœ ๋ณ€๊ฒฝํ•ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š Data Preprocessing (grade)

df3_grade_encoded = pd.get_dummies(df3['grade'], prefix='grade').astype(int)

df4 = pd.concat([df3, df3_grade_encoded], axis=1)
df5 = df4.drop('grade', axis=1, inplace=False)
df5.head()


A~G์˜ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋กœ ์ด๋ฃจ์–ด์ง„ grade ์ปฌ๋Ÿผ์„ 0~6์˜ ์ˆซ์ž๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ์ด๋Ÿฐ ๋ฒ”์ฃผ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆซ์ž๊ฐ’์€ ์ˆซ์ž์˜ ํฌ๊ธฐ์˜ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๋Š” ๊ฐ’์ด ์•„๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ถ”ํ›„ ์ •๊ทœํ™”๋ฅผ ์‹œํ‚ค๋ฉด ์•ˆ๋˜๊ธฐ ๋•Œ๋ฌธ์— One-Hot Encoding ์ž‘์—…์„ ํ•ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š Data Preprocessing (home_ownership)

df5_grade_encoded = pd.get_dummies(df5['home_ownership'], prefix='home_ownership').astype(int)

df6 = pd.concat([df5, df5_grade_encoded], axis=1)
df7 = df6.drop('home_ownership', axis=1, inplace=False)
df7.head()


home_ownership ์—ญ์‹œ One-Hot Encoding ์ž‘์—…์„ ํ•ด์ฃผ์—ˆ๋‹ค.

๐Ÿ“š Data Preprocessing (purpose)

df7_grade_encoded = pd.get_dummies(df7['purpose'], prefix='purpose').astype(int)

df8 = pd.concat([df7, df7_grade_encoded], axis=1)
df9 = df8.drop('purpose', axis=1, inplace=False)
df9.head()


purpose ์—ญ์‹œ One-Hot Encoding ์ž‘์—…์„ ํ•ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š Data Preprocessing (emp_length)

def convert_year(x):
    if x == '< 1 year':
        return 0
    elif x == '10+ years':
        return 10
    else:
        return int(x.split()[0])

df9['emp_length_year'] = df9['emp_length'].apply(convert_year)

์šฐ์„  ๊ทผ์†๊ธฐ๊ฐ„์ด ๋‹ด๊ฒจ์žˆ๋Š” emp_length ์ปฌ๋Ÿผ์€ short, medium, long์œผ๋กœ ๊ตฌ๊ฐ„ํ™” ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด ์šฐ์„  ์ˆซ์ž๋กœ ๋ณ€ํ˜•ํ•˜์—ฌ emp_length_year ์ปฌ๋Ÿผ์— ์ €์žฅํ•ด์ฃผ์—ˆ๋‹ค.

conditions = [(df9['emp_length_year'] <= 1),
            ((df9['emp_length_year'] >= 2) & (df9['emp_length_year'] <= 7)),
            (df9['emp_length_year'] >= 8)]
choices = ['short', 'medium', 'long']

df9['emp_length_year_binned'] = np.select(conditions, choices)
df9.head()

df9_emp_length_year_binned_encoded = pd.get_dummies(df9['emp_length_year_binned'], prefix='emp_length_year_binned').astype(int)

df10 = pd.concat([df9, df9_emp_length_year_binned_encoded], axis=1)
df11 = df10.drop(['emp_length','emp_length_year','emp_length_year_binned'], axis=1, inplace=False)
df11


์ดํ›„, ์ˆซ์ž๋กœ ๋ณ€ํ˜•๋œ emp_length_year ์—ด์„ 1๋ณด๋‹ค ์ž‘์œผ๋ฉด short, 2~7์€ medium, 8๋ณด๋‹ค ํฌ๋ฉด long์œผ๋กœ ๊ตฌ๊ฐ„ํ™” ์ฒ˜๋ฆฌ ํ›„, ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ One-Hot Encoding ์ž‘์—…์„ ํ•ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š Data Preprocessing (loan_status)

df11['loan_status_binary'] = np.where(df11['loan_status'].isin(['Current', 'Fully Paid']), 0, 1)
df12 = df11.drop('loan_status', axis=1, inplace=False)
df12

์ข…์†๋ณ€์ˆ˜์ธ loan_status์˜ ๊ฒฝ์šฐ,

Charged Off : ์ฑ„๋ฌด ํฌ๊ธฐ
Current : ์ •์ƒ์ ์œผ๋กœ ์ƒํ™˜ ์ค‘
Default : ๋Œ€์ถœ์ƒํ™˜ x
Does not meet the credit policy. Status:Charged Off : ์‹ ์šฉ์ •์ฑ… ์ถฉ์กฑ x, ์ฑ„๋ฌด ํฌ๊ธฐ
Does not meet the credit policy. Status:Fully Paid : ์‹ ์šฉ์ •์ฑ… ์ถฉ์กฑ O, ์ „์•ก ์ƒํ™˜
Fully Paid : ์ „์•ก ์ƒํ™˜
In Grace Period : ๋Œ€์ถœ ์ƒํ™˜ ์œ ์˜ˆ
Late (16-30 days) : ๋Œ€์ถœ ์ƒํ™˜ 16~30์ผ ์ง€์—ฐ
Late (31-120 days) : ๋Œ€์ถœ ์ƒํ™˜ 31~120์ผ ์ง€์—ฐ

์œ„์™€ ๊ฐ™์ด ์ด๋ฃจ์–ด์ ธ์žˆ๋Š”๋ฐ, Current์™€ Fully Paid๋ฅผ ์ œ์™ธํ•œ ๋‹ค๋ฅธ ์ปฌ๋Ÿผ๋“ค์€ ๋ชจ๋‘ 1(๋น„์ •์ƒ)๋กœ ๋ณ€ํ™˜ํ•˜์˜€๋‹ค.

print(np.unique(df12['loan_status_binary'], return_counts=True))
# (array([0, 1]), array([1832108,  281536], dtype=int64))

๋˜ํ•œ ๋ฐ์ดํ„ฐ์˜ ๋ถˆ๊ท ํ˜•์ด ๋งค์šฐ ์‹ฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋“  ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ๋งˆ์นœ ํ›„, SMOTE๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ๋ถˆ๊ท ํ˜•์„ ํ•ด๊ฒฐํ•ด์•ผํ•œ๋‹ค.


๐Ÿ“š ์ด์ƒ์น˜ ํ™•์ธ

fig = plt.figure()

ax1 = fig.add_subplot(1, 4, 1)
ax2 = fig.add_subplot(1, 4, 2)
ax3 = fig.add_subplot(1, 4, 3)
ax4 = fig.add_subplot(1, 4, 4)

ax1.set_title('loan_amnt')
ax2.set_title('int-rate')
ax3.set_title('annual_inc')
ax4.set_title('dti')

ax1.boxplot(df12['loan_amnt'])
ax2.boxplot(df12['int_rate'])
ax3.boxplot(df12['annual_inc'])
ax4.boxplot(df12['dti'])

plt.tight_layout()

plt.show()

์šฐ์„  ์ „๋ฐ˜์ ์ธ ์ด์ƒ์น˜์˜ ๋ถ„ํฌ๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด boxplot์œผ๋กœ ํ™•์ธํ•ด๋ณด์•˜๋‹ค.

pd.set_option('display.float_format', '{:.2f}'.format)

df12.describe().iloc[:,0:4]

์ดํ›„ ๋” ์ž์„ธํ•œ ๊ฐ’์„ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ๊ธฐ๋ณธ ํ†ต๊ณ„๋ฅผ ์ด์šฉํ•ด์„œ ํ™•์ธํ•ด๋ณด์•˜๋‹ค.


๐Ÿ“š ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ (dti)

df13 = df12[(df12['dti'] >= 0) & (df12['dti'] < 100)]

df13.describe().iloc[:,0:4]


์šฐ์„  dti์˜ ๊ฒฝ์šฐ ๋ถ€์ฑ„๋น„์œจ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฌด์กฐ๊ฑด ์–‘์ˆ˜์ด์–ด์•ผ ํ•˜๊ณ , ๋น„์œจ์ด 100์ด ๋„˜๋Š” ๊ฒƒ๋“ค์€ ์ œ์™ธํ•˜์˜€๋‹ค.


๐Ÿ“š ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ (annual_inc)

tmp = df13['annual_inc'].quantile(0.99) # 275000.0

# clip() : upper๋ณด๋‹ค ๋„˜๋Š” ๊ฐ’์€ ๋ชจ๋‘ upper๋กœ ๊ต์ฒด
df13['annual_inc'] = df13['annual_inc'].clip(upper=tmp)

df13.describe().iloc[:,0:4]


annual_inc์˜ ๊ฒฝ์šฐ ์—ฐ๊ฐ„ ์ˆ˜์ž…์ด๊ธฐ ๋•Œ๋ฌธ์— ์•„๋ฌด๋ฆฌ ํฌ๋”๋ผ๋„ ์‹ค์ œ๋ฐ์ดํ„ฐ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด ์ด์ƒ์น˜๊ฐ€ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์‚ญ์ œ๋ณด๋‹จ clipping ์ž‘์—…์„ ํ†ตํ•ด ์ƒ์œ„ 1% ๋ฐ์ดํ„ฐ๋กœ ๋Œ€์ฒดํ•ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ (dti, annual_inc)

df13['annual_inc_log'] = np.log1p(df13['annual_inc'])
df13['dti_log'] = np.log1p(df13['dti'])

df14 = df13.drop(['annual_inc', 'dti'], axis=1, inplace=False)
df14


๋˜ํ•œ ์ด์ƒ์น˜๊ฐ€ ๋งค์šฐ ๋งŽ๊ณ  ์ผ๋ฐ˜์ ์ธ ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ์ด ์ ์€ annual_inc, dti ์ปฌ๋Ÿผ์— ๋กœ๊ทธ๋ณ€ํ™˜์„ ํ•ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ์ฒ˜๋ฆฌ (SMOTE)

x_data = df14.drop('loan_status_binary', axis=1, inplace=False).values
t_data = df14['loan_status_binary'].values

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
x_data_resampled, t_data_resampled = smote.fit_resample(x_data, t_data)

np.unique(t_data_resampled, return_counts=True) # (array([0, 1]), array([1830804, 1830804], dtype=int64))

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์˜ ๋งˆ์ง€๋ง‰์œผ๋ก , ํ˜„์žฌ ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋Š” ๋ถˆ๊ท ํ˜•์ด ๋งค์šฐ ์‹ฌํ•œ๋ฐ, ์ด๋ฅผ SMOTE ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ Over Sampling ์ž‘์—…์„ ํ•ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š ๋ฐ์ดํ„ฐ ์ •๊ทœํ™”

scaler = MinMaxScaler()
scaler.fit(x_data_resampled)
x_data_norm = scaler.transform(x_data_resampled)

ํ˜„์žฌ ๋ฐ์ดํ„ฐ๋Š” ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ๋„ ๋ชจ๋‘ ๋งˆ์นœ ์ƒํƒœ์ด๊ธฐ ๋•Œ๋ฌธ์— Min-Max Scaling์„ ํ†ตํ•œ ์ „์ฒ˜๋ฆฌ ์ž‘์—…์„ ํ•ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š ๋ฐ์ดํ„ฐ ๋ถ„ํ• 

x_data_train_norm, x_data_test_norm, t_data_train, t_data_test = \
train_test_split(x_data_norm,
                t_data_resampled,
                test_size=0.2,
                stratify=t_data_resampled)

ํ•™์Šต์„ ์œ„ํ•ด train/test set๋ฅผ ๋ถ„๋ฆฌํ•ด์ฃผ์—ˆ๋‹ค.


๐Ÿ“š XGBoost Model ๊ตฌํ˜„ ๋ฐ ํ•™์Šต

xgbc = XGBClassifier()
xgbc.fit(x_data_train_norm, t_data_train)

๐Ÿ“š ๋ชจ๋ธ ํ‰๊ฐ€

result = xgbc.predict(x_data_test_norm)
print(classification_report(t_data_test, result))
#               precision    recall  f1-score   support

#            0       0.77      0.84      0.80    366161
#            1       0.82      0.75      0.78    366161

#     accuracy                           0.79    732322
#    macro avg       0.79      0.79      0.79    732322
# weighted avg       0.79      0.79      0.79    732322

ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ตœ์ข… F1 Score๋Š” 0.79๋กœ ๋„์ถœ๋˜์—ˆ๋‹ค.


โœจ ๊ฒฐ๊ณผ

              precision    recall  f1-score   support

           0       0.77      0.84      0.80    366161
           1       0.82      0.75      0.78    366161

    accuracy                           0.79    732322
   macro avg       0.79      0.79      0.79    732322
weighted avg       0.79      0.79      0.79    732322

์ด ๋ฌธ์ œ๋Š” ๋”ฐ๋กœ ๋ฐ์ดํ„ฐ ์…‹๋งŒ ๋ฐ›์•„ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ ๊ฒƒ์ด๊ธฐ์— ์ œ์ถœ ์ ์ˆ˜๋Š” ์—†๋‹ค.


๐Ÿ’ญ ํ›„๊ธฐ

๊ธฐ๋Œ€ํ•œ ๊ฒƒ๋ณด๋‹ค ๋‚ฎ์€ ์ ์ˆ˜๊ฐ€ ๋‚˜์™”๋‹ค... Feature Selection์„ ์ž„์˜๋กœ ํ•˜๊ณ  ๋ชจ๋ธ์— ๋Œ€ํ•œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹๋„ ํ•˜์ง€์•Š์•„ ๊ทธ๋Ÿฐ ๊ฒƒ๊ฐ™๋‹ค. ์ถ”ํ›„์— ์ด๋Ÿฌํ•œ ์ž‘์—…๋“ค์„ ์ถ”๊ฐ€ํ•ด ๋‹ค์‹œ ํ•™์Šตํ•ด๋ณด์•„์•ผ๊ฒ ๋‹ค.


๐Ÿ”— ๋ฌธ์ œ ์ถœ์ฒ˜

https://www.kaggle.com/datasets/wordsforthewise/lending-club


profile
็ฒพ้€ฒ "์ •์„ฑ์„ ๊ธฐ์šธ์—ฌ ๋…ธ๋ ฅํ•˜๊ณ  ๋งค์ง„ํ•œ๋‹ค"

0๊ฐœ์˜ ๋Œ“๊ธ€