신용카드 사용자 연체 예측 AI 경진대회

danbibibi·2021년 11월 18일

AI ML dacon 대회

1.주제

신용카드 사용자 데이터를 보고 사용자의 대금 연체 정도를 예측하는 알고리즘 개발

2. 배경

신용카드사는 신용카드 신청자가 제출한 개인정보와 데이터를 활용해 신용 점수를 산정합니다. 신용카드사는 이 신용 점수를 활용해 신청자의 향후 채무 불이행과 신용카드 대급 연체 가능성을 예측합니다.
현재 많은 금융업계는 인공지능(AI)를 활용한 금융 서비스를 구현하고자 합니다. 사용자의 대금 연체 정도를 예측할 수 있는 인공지능 알고리즘을 개발해 금융업계에 제안할 수 있는 인사이트를 발굴해주세요!

3. 대회 설명

신용카드 사용자들의 개인 신상정보 데이터로 사용자의 신용카드 대금 연체 정도를 예측
(평가 기준: log_loss)

4. 데이터 변수 설명

index
Sex: 성별
Annual_income: 연간 소득
income_type: 소득 분류 ['Commercial associate', 'Working', 'State servant', 'Pensioner', 'Student']
Education: 교육 수준 ['Higher education' ,'Secondary / secondary special', 'Incomplete higher', 'Lower secondary', 'Academic degree']
family_type: 결혼 여부 ['Married', 'Civil marriage', 'Separated', 'Single / not married', 'Widow']
house_type: 생활 방식 ['Municipal apartment', 'House / apartment', 'With parents','Co-op apartment', 'Rented apartment', 'Office apartment']
DAYS_BIRTH: 출생일 (데이터 수집 당시 (0)부터 역으로 셈, 즉, -1은 데이터 수집일 하루 전에 태어났음을 의미)
working_day: 업무 시작일 (데이터 수집 당시 (0)부터 역으로 셈, 즉, -1은 데이터 수집일 하루 전부터 일을 시작함을 의미)
FLAG_MOBIL: 핸드폰 소유 여부
work_phone: 업무용 전화 소유 여부
phone: 전화 소유 여부
email: 이메일 소유 여부
occyp_type: 직업 유형
begin_month: 신용카드 발급 월 (데이터 수집 당시 (0)부터 역으로 셈, 즉, -1은 데이터 수집일 한 달 전에 신용카드를 발급함을 의미)
car_reality: 차량 및 부동산 소유 여부 [0 : 두가지 모두 소유 하지않음 , 1: 한가지만 소유, 2: 두가지 모두 소유]
credit: 사용자의 신용카드 대금 연체를 기준으로 한 신용도
=> 낮을 수록 높은 신용의 신용카드 사용자를 의미함

5. EDA, 데이터 전처리

필요한 library import

import os, random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import VotingClassifier, RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from tensorflow.keras.utils import to_categorical

데이터 불러오기

train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Prediction_of_default_rate/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Prediction_of_default_rate/test.csv')
submission = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Prediction_of_default_rate/sample_submission.csv')

데이터 전처리

train.info()

# 결측치 처리
train['occyp_type'] = train['occyp_type'].fillna('Null')
test['occyp_type'] = test['occyp_type'].fillna('Null')

# binary type (여성 - 0, 남성 - 1)
train['gender'] = train['gender'].replace({'F':0, 'M':1})
test['gender'] = test['gender'].replace({'F':0, 'M':1}) 


# 무의미한 변수 제거
train.drop('FLAG_MOBIL', axis=1, inplace=True) 
del train['index']

test.drop('FLAG_MOBIL', axis=1, inplace=True)
del test['index']

# one-hot encoding
train = pd.get_dummies(train)
test = pd.get_dummies(test)

# 수치형 데이터 전처리(0~1)
train_x['DAYS_BIRTH'] = train_x['DAYS_BIRTH'] / train_x['DAYS_BIRTH'].min()
test_x['DAYS_BIRTH'] = test_x['DAYS_BIRTH'] / test_x['DAYS_BIRTH'].min()
train_x['working_day'] = train_x['working_day'] / train_x['working_day'].min()
test_x['working_day'] = test_x['working_day'] / test_x['working_day'].min()
train_x['begin_month'] = train_x['begin_month'] / train_x['begin_month'].min()
test_x['begin_month'] = test_x['begin_month'] / test_x['begin_month'].min()

train_x['Annual_income'] = train_x['Annual_income'] / train_x['Annual_income'].max()
test_x['Annual_income'] = test_x['Annual_income'] / test_x['Annual_income'].max()

train data와 test data 분리

train_x = train.drop('credit', axis=1)
train_y = train[['credit']]
test_x = test
print(train_x.shape, train_y.shape, test_x.shape)

X_train, X_val, y_train, y_val = train_test_split(train_x, train_y, 
                                                    stratify=train_y,
                                                  test_size=0.2,
                                                    random_state = SEED)

6. Training

Randomforest Classifier

model_RF = RandomForestClassifier(n_estimators=500, max_features=16, random_state=SEED)
model_RF.fit(X_train, y_train)
y_pred = model_RF.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), y_pred)}")

Decision Tree Classifier

model_TREE = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2, random_state=SEED)
model_TREE.fit(X_train, y_train)
y_pred = model_TREE.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), y_pred)}")

LGBNM Classifier

model_LGBM = LGBMClassifier(n_estimators=10000, num_leaves=50, subsample=0.8,learning_rate=0.01,
                      min_child_samples=60, max_depth=20)
evals = [(X_val, y_val)]
model_LGBM.fit(X_train, y_train, early_stopping_rounds=100,
                 eval_set=evals, eval_metric='logloss',verbose=False)
pred_y = model_LGBM.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), pred_y)}")

BaggingClassifier

model_BAG1 = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=600,
    max_samples=0.7,
    max_features=0.6, 
    bootstrap=True,
    n_jobs=-1 
)
model_BAG1.fit(X_train, y_train)
y_pred = model_BAG1.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), y_pred)}")

model_BAG2 = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=550,
    max_samples=0.7,
    max_features=0.6, 
    bootstrap=True,
    n_jobs=-1 
)
model_BAG2.fit(X_train, y_train)
y_pred = model_BAG2.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), y_pred)}")

VotingClassifier

model_VOTING = VotingClassifier(estimators=[('LGBM', model_LGBM),
                                      ('BAGClassifier1', model_BAG1),
                                      ('BAGlassifier2', model_BAG2),
                                      ('RF', model_RF),
                                      ('TREE', model_TREE)],
                         voting='soft')
model_VOTING.fit(X_train, y_train)
pred_y = model_VOTING.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), pred_y)}")

7. Test

pred = model_VOTING.predict_proba(test)
submission.loc[:,1:] = pred
submission.to_csv('/content/drive/MyDrive/Colab Notebooks/Prediction_of_default_rate/7120501821010008.csv',index=False)

8. 소감

첫 대회라 많이 부족했지만, 데이터를 전처리하고 모델을 적용해보는 과정이 정말 재밌었기 때문에 앞으로도 다양한 대회에 참여해볼 의향이 100%다! 열심히 공부해서 다음 대회에는 꼭 더 좋은 성적을 내기를 희망한다 :)

danbibibi

블로그 이전) https://danbibibi.tistory.com

이전 포스트

AI를 이용한 노트 필기 앱 : APlus (데이터 전처리-csv 파일생성)

다음 포스트