๐Ÿ’  AIchemist 5th Session | ๋ถ„๋ฅ˜ ์บ๊ธ€ ํ•„์‚ฌ

yellowsubmarine372ยท2023๋…„ 10์›” 30์ผ

AIchemist

๋ชฉ๋ก ๋ณด๊ธฐ
7/14
post-thumbnail

์บ๊ธ€ ์‚ฐํƒ„๋ฐ๋ฅด ๊ณ ๊ฐ ๋งŒ์กฑ ์˜ˆ์ธก

downloads

  • Santnadler Customer Satisfaction

Santnadler Customer Satisfaction

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

111๊ฐœ์˜ ํ”ผ์ฒ˜๊ฐ€ floatํ˜•, 260๊ฐœ์˜ ํ”ผ์ฒ˜๊ฐ€ int ํ˜•์œผ๋กœ ๋ชจ๋“  ํ”ผ์ฒ˜๊ฐ€ ์ˆซ์žํ˜•์ด๋ฉฐ, Null ๊ฐ’์€ ์—†๋‹ค. ๋ ˆ์ด๋ธ” Target์—์„œ ๋Œ€๋ถ€๋ถ„์ด ๋งŒ์กฑ์ด๋ฉฐ ๋ถˆ๋งŒ์กฑ์ธ ๊ณ ๊ฐ์€ 4%์— ๋ถˆ๊ณผํ•œ๋‹ค.

min๊ฐ’์˜ -999999์„ ์ตœ๋‹ค๊ฐ’ 2๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ID ํ”ผ์ฒ˜๋Š” ๋‹จ์ˆœ ์‹๋ณ„์ž์ด๋ฏ€๋กœ ํ”ผ์ฒ˜๋ฅผ ๋“œ๋กญํ•œ๋‹ค.

  • ํ…Œ์ŠคํŠธ ์„ธํŠธ์™€ ํ›ˆ๋ จ์„ธํŠธ ๋ถ„๋ฆฌ ๋ฐ ๊ฒ€์ฆ ์„ธํŠธ ์ƒ์„ฑ
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_features, y_labels, test_size=0.2, random_state=0)

train_cnt = y_train.count()
test_cnt = y_test.count()
print('ํ•™์Šต ์„ธํŠธ Shape:{0}, ํ…Œ์ŠคํŠธ ์„ธํŠธ Shape:{1}'.format(X_train.shape, X_test.shape))

print('ํ•™์Šต ์„ธํŠธ ๋ ˆ์ด๋ธ” ๊ฐ’ ๋ถ„ํฌ ๋น„์œจ')
print(y_train.value_counts()/train_cnt)
print('\nํ…Œ์ŠคํŠธ ์„ธํŠธ ๋ ˆ์ด๋ธ” ๊ฐ’ ๋ถ„ํฌ ๋น„์œจ')
print(y_test.value_counts()/test_cnt)

#XGBoost ์กฐ๊ธฐ์ค‘๋‹จ์„ ์œ„ํ•ด ๊ฒ€์ฆ ์„ธํŠธ ๋ถ„๋ฆฌ
#X_train, y_train์„ ๋‹ค์‹œ ํ•™์Šต๊ณผ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ๋ถ„๋ฆฌ
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.3, random_state=0)

XGBoost ๋ชจ๋ธ ํ•™์Šต๊ณผ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

  • XGBoost ๋ชจ๋ธ ํ•™์Šต
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

#n_estimators๋Š” 500์œผ๋กœ, random state๋Š” ์˜ˆ์ œ ์ˆ˜ํ–‰ ์‹œ๋งˆ๋‹ค ๋™์ผ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ์„ค์ •
xgb_clf = XGBClassifier(n_estimators=500, learning_rate=0.05, random_state =156)

#์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ auc๋กœใ…ก ์กฐ๊ธฐ ์ค‘๋‹จ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” 100์œผ๋กœ ์„ค์ •ํ•˜๊ณ  ํ•™์Šต ์ˆ˜ํ–‰
xgb_clf.fit(X_tr, y_tr, early_stopping_rounds=100, eval_metric="auc", eval_set = [(X_tr, y_tr), (X_val, y_val)])

xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1])
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))
[Output]
ROC AUC: 0.8429
  • HyperOpt๋ฅผ ์ด์šฉํ•ด ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ์ˆ˜ํ–‰
  1. ๋ชฉ์  ํ•จ์ˆ˜ ์ƒ์„ฑ

3 Fold ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ด์šฉํ•ด ํ‰๊ท  ROC-AUC ๊ฐ’ ๋ฐ˜ํ™˜ (-1*ROC-AUC)

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score

# fmin()์—์„œ ํ˜ธ์ถœ์‹œ search_space ๊ฐ’์œผ๋กœ XGBClassifier ๊ต์ฐจ ๊ฒ€์ฆ ํ•™์Šต . ํ›„ -1*roc_auc ํ‰๊ท  ๊ฐ’์„ ๋ฐ˜ํ™˜
# ๋ชฉ์  ํ•จ์ˆ˜
def objective_func(search_space):
    xgb_clf = XGBClassifier(n_estimators=100, max_depth=int(search_space['max_depth']),
                           min_child_weight= int(search_space['min_child_weight']),
                           colsample_bytree = search_space['colsample_bytree'],
                           learning_rate = search_space['learning_rate'])
    
    #3๊ฐœ k-fold ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€๋œ roc_auc ์ง€ํ‘œ๋ฅผ ๋‹ด๋Š” list
    roc_auc_list = []
    
    #3๊ฐœ k-fold ๋ฐฉ์‹ ์ ์šฉ 
    kf = KFold(n_splits=3)
    # Xtrain์„ ๋‹ค์‹œ ํ•™์Šต๊ณผ ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ๋กœ ๋ถ€๋‹
    for tr_index, val_index in kf.split(X_train):
        #kf.split(X_train)์œผ๋กœ ์ถ”์ถœ๋œ ํ•™์Šต๊ณผ ๊ฒ€์ฆ index ๊ฐ’์œผ๋กœ ํ•™์Šต๊ณผ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ถ„๋ฆฌ
        X_tr, y_tr = X_train.iloc[tr_index], y_train.iloc[tr_index]
        X_val, y_val = X_train.iloc[val_index], y_train.iloc[val_index]
        
        #early stopping์€ 30ํšŒ๋กœ ์„ค์ •ํ•˜๊ณ  ์ถ”์ถœ๋œ ํ•™์Šต๊ณผ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ XGBClassifier ํ•™์Šต ์ˆ˜ํ–‰
        xgb_clf.fit(X_tr, y_tr, early_stopping_rounds=30, eval_metric="auc", eval_set=[(X_tr, y_tr), (X_val, y_val)])
        
        #1๋กœ ์˜ˆ์ธกํ•œ ํ™•๋ฅ ๊ฐ’ ์ถ”์ถœ ํ›„ roc auc ๊ณ„์‚ฐํ•˜๊ณ  ํ‰๊ท  roc auc ๊ณ„์‚ฐ์„ ์œ„ํ•ด List์— ๊ฒฐ๊ด๊ฐ’ ๋‹ด์Œ.
        score = roc_auc_score(y_val, xgb_clf.predict_proba(X_val)[:,1])
        roc_auc_list.append(score)
        
    # 3๊ฐœ k-fold๋กœ ๊ณ„์‚ฐ๋œ roc auc ๊ฐ’์˜ ํ‰๊ท ๊ฐ’์„ ๋ฐ˜ํ•œํ•˜๋˜,
    # HyperOpt๋Š” ๋ชฉ์ ํ•จ์ˆ˜์˜ ์ตœ์†Ÿ๊ฐ’์„ ์œ„ํ•œ ์ž…๋ ฅ๊ฐ’์„ ์ฐพ์œผ๋ฏ€๋กœ -1์„ ๊ณฑํ•œ ๋’ค ๋ฐ˜ํ™˜
    return -1*np.mean(roc_auc_list)
    
from hyperopt import fmin, tpe, Trials

trials = Trials()

# fmin() ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœ. max_evals ์ง€์ •๋œ ํšŸ์ˆ˜๋งŒํผ ๋ฐ˜๋ณต ํ›„ ๋ชฉ์ ํ•จ์ˆ˜์˜ ์ตœ์†Ÿ๊ฐ’์„ ๊ฐ€์ง€๋Š” ์ตœ์ € ์ž…๋ ฅ๊ฐ’ ์ถ”์ถœ.
# ์ž…๋ ฅ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ตœ์ € ์ž…๋ ฅ๊ฐ’์„ ์ถ”์ถœํ•˜๋Š” fmin
best = fmin(fn=objective_func,
           space = xgb_search_space,
           algo = tpe.suggest,
           max_evals = 50, #์ตœ๋Œ€ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค
           trials = trials, rstate = np.random.default_rng(seed=30))

print('best:', best)
#30๋ถ„ ์†Œ์š”...

๋ชฉ์  ๋ฐ˜ํ™˜ ์ตœ์†Ÿ๊ฐ’์„ ๊ฐ€์ง€๋Š” ์ตœ์ € ์ž…๋ ฅ๊ฐ’ ์œ ์ถ”

  1. ์ฐพ์€ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ XGBClassifier ์žฌํ•™์Šต
#n_estimators๋ฅผ 500 ์ฆ๊ฐ€ ํ›„ ์ตœ์ ์œผ๋กœ ์ฐพ์€ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๊ณผ ์˜ˆ์ธก ์ˆ˜ํ–‰
xgb_clf = XGBClassifier(n_estimators=500, learning_rate = round(best['learning_rate'], 5), 
                       max_depth = int(best['max_depth']), #๊ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ์— ์ตœ์ € ์ž…๋ ฅ๊ฐ’(best[] list) ๋„ฃ๊ธฐ
                       min_child_weight=int(best['min_child_weight']),
                        colsample_bytree = round(best['colsample_bytree'], 5)
                       )

#evaluation metric์„ auc๋กœ, early stopping์€ 100์œผ๋กœ ์„ค์ •ํ•˜๊ณ  ํ•™์Šต ์ˆ˜ํ–‰
xgb_clf.fit(X_tr, y_tr, early_stopping_rounds=100,
           eval_metric="auc", eval_set=[(X_tr, y_tr), (X_val, y_val)])

xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1])
print('ROC AUC: {0:4f}'.format(xgb_roc_score))

LightGBM ๋ชจ๋ธ ํ•™์Šต๊ณผ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

  • LightGBM ๋ชจ๋ธ ํ•™์Šต
from lightgbm import LGBMClassifier

lgbm_clf = LGBMClassifier(n_estimators = 500)

eval_set = [(X_tr, y_tr), (X_val, y_val)]
lgbm_clf.fit(X_tr, y_tr, early_stopping_rounds=100, eval_metric="auc", eval_set= eval_set)

lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1])
print('ROC AUC:{0:.4f}'.format(lgbm_roc_score))
  • HyperOpt๋ฅผ ์ด์šฉํ•ด ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹
lgbm_search_space = {'num_leaves' :hp.quniform('num_leaves', 32, 64, 1),
                    'max_depth':hp.quniform('max_depth', 100, 160, 1),
                    'min_child_samples':hp.quniform('min_child_samples', 60, 100,1),
                    'subsample': hp.uniform('subsample', 0.7, 1),
                    'learning_rate': hp.uniform('learning_rate', 0.01, 0.2)}
               
def objective_func(search_space):
    lgbm_clf = LGBMClassifier(n_estimators=100,
                             num_leaves = int(search_space['num_leaves']),
                             max_depth = int(search_space['max_depth']),
                             min_child_samples = int(search_space['min_child_samples']),
                             subsample = search_space['subsample'],
                             learning_rate = search_space['learning_rate'])
    #3๊ฐœ k-fold ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€๋œ roc_auc ์ง€ํ‘œ๋ฅผ ๋‹ด๋Š” list
    roc_auc_list = []
    
    #3๊ฐœ k-fold ๋ฐฉ์‹ ์ ์šฉ
    kf = KFold(n_splits=3)
    #X_train์„ ๋‹ค์‹œ ํ•™์Šต๊ณผ ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„๋ฆฌ
    for tr_index, val_index in kf.split(X_train):
        #kf.split(X_tain)์œผ๋กœ ์ถ”์ถœ๋œ ํ•™์Šต๊ณผ ๊ฒ€์ฆ index ๊ฐ’์œผ๋กœ ํ•™์Šต๊ณผ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ถ„๋ฆฌ 
        X_tr, y_tr = X_train.iloc[tr_index], y_train.iloc[tr_index]
        X_val, y_val = X_train.iloc[val_index], y_train.iloc[val_index]
        
        #early stopping์€ 30ํšŒ๋กœ ์„ค์ •ํ•˜๊ณ  ์ถ”์ถœ๋œ ํ•™์Šต๊ณผ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ XGBClassifer ํ•™์Šต ์ˆ˜ํ–‰
        lgbm_clf.fit(X_tr, y_tr, early_stopping_rounds=30, eval_metric="auc",
                    eval_set = [(X_tr, y_tr), (X_val, y_val)])
        
        # 1๋กœ ์˜ˆ์ธกํ•œ ํ™•๋ฅ ๊ฐ’ ์ถ”์ถœ ํ›„ roc auc ๊ณ„์‚ฐํ•˜๊ณ  ํ‰๊ท  roc auc ๊ณ„์‚ฐ์„ ์œ„ํ•ด list์— ๊ฒฐ๊ด๊ฐ’ ๋‹ด์Œ.
        score = roc_auc_score(y_val, lgbm_clf.predict_proba(X_val)[:,1])
        roc_auc_list.append(score)
        
    #3๊ฐœ k-fold๋กœ ๊ณ„์‚ฐ๋œ roc_auc ๊ฐ’์˜ ํ‰๊ท ๊ฐ’์„ ๋ฐ˜ํ•œํ•˜๋˜, 
    #HyperOpt๋Š” ๋ชฉ์ ํ•จ์ˆ˜์˜ ์ตœ์†Ÿ๊ฐ’์„ ์œ„ํ•œ ์ž…๋ ฅ๊ฐ’์„ ์ฐพ์œผ๋ฏ€๋กœ -1์„ ๊ณฑํ•œ๋’ค ๋ฐ˜ํ™˜.
    return -1*np.mean(roc_auc_list)

(์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ROC-AUC ํ‰๊ฐ€ ์ƒ๋žต)

[Output]
ROC AUC : 0.8446

XGBoost์™€ ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ž„. LightGBM์˜ ํ•™์Šต ์‹œ๊ฐ„์ด XGBoost๋ณด๋‹ค ๋น ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— LightGBM์œผ๋กœ ํ›ˆ๋ จ ํ•„์š”.



์บ๊ธ€ ์‹ ์šฉ์นด๋“œ ์‚ฌ๊ธฐ ๊ฒ€์ถœ

downloads

  • Credit Card Fraud Detection

Credit Card Fraud Detection

์–ธ๋” ์ƒ˜ํ”Œ๋ง๊ณผ ์˜ค๋ฒ„ ์ƒ˜ํ”Œ๋ง

์ด์ƒ ๋ฐ์ดํ„ฐ ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ํŒจํ„ด์—์„œ ๋ฒ—์–ด๋‚œ ์ด์ƒ ๊ฐ’์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ

์ด์ƒ ๋ ˆ์ด๋ธ”์„ ๊ฐ€์ง€๋Š” ๋ฐํ‹ฐํ„ฐ ๊ฑด์ˆ˜๋Š” ๋งค์šฐ ์ ๊ธฐ ๋•Œ๋ฌธ์— ์ œ๋Œ€๋กœ ๋‹ค์–‘ํ•œ ์œ ํ˜• ํ•™์Šต์„ ๋ชปํ•˜๊ณ  ์ •์ƒ ๋ ˆ์ด๋ธ”๋กœ ์น˜์šฐ์นœ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ด ์ œ๋Œ€๋กœ ๋œ ์ด์ƒ ๋ฐ์ดํ„ฐ ๊ฒ€์ถœ์ด ์–ด๋ ค์›Œ์ง
(ํ‰๊ท ์น˜๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์€ ์ด์ƒ ๋ฐ์ดํ„ฐ ๊ฒ€์ถœ์ด ์–ด๋ ต๋‹ค)

์–ธ๋” ์ƒ˜ํ”Œ๋ง ๋งŽ์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ ์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ˆ˜์ค€์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ค๋Š” ๋ฐฉ์‹
๊ณผ๋„ํ•˜๊ฒŒ ์ •์ƒ๋ ˆ์ด๋ธ”๋กœ ํ•™์Šต/์˜ˆ์ธกํ•˜๋Š” ๋ถ€์ž‘์šฉ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ œ๋Œ€๋กœ ๋œ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜์—†๋Š” ๋ฌธ์ œ๋„ ๋ฐœ์ƒ ๊ฐ€๋Šฅ์„ฑ
์˜ค๋ฒ„ ์ƒ˜ํ”Œ๋ง ์ด์ƒ ๋ฐ์ดํ„ฐ์™€ ๊ฐ™์ด ์ ์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ฆ์‹ํ•˜์—ฌ ํ•™์Šต์„ ์œ„ํ•œ ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•๋ณดํ•˜๋Š” ๋ฐฉ๋ฒ•. ๊ณผ์ ํ•ฉ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ ํ”ผ์ฒ˜ ๊ฐ’๋“ค์„ ์•„์ฃผ ์•ฝ๊ฐ„๋งŒ ๋ณ€๊ฒฝํ•˜์—ฌ ์ฆ์‹.
SMOTE (์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง) k-์ตœ๊ทผ์ ‘ ์ด์›ƒ์—์„œ ์ด์›ƒ๋“ค์˜ ์ฐจ์ด๋ฅผ ์ผ์ • ๊ฐ’์œผ๋กœ ๋งŒ๋“ค์–ด์„œ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์™€ ์•ฝ๊ฐ„ ์ฐจ์ด๊ฐ€ ๋‚˜๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋“ค์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ์‹

๋ฐ์ดํ„ฐ ์ผ์ฐจ ๊ฐ€๊ณต ๋ฐ ๋ชจ๋ธ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€

  • ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ ๋ถ„๋ฆฌ

ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ์ „์ฒด์˜ 30%์ธ Stratified ๋ฐฉ์‹์œผ๋กœ ์ถ”์ถœํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๋ ˆ์ด๋ธ” ๊ฐ’ ๋ถ„ํฌ๋„๋ฅผ ์„œ๋กœ ๋™์ผํ•˜๊ฒŒ ๋งŒ๋“ฆ.

Stratified Sampling
์ž„์˜ ์ถ”์ถœ์€ ๋ฐ์ดํ„ฐ ๋น„์œจ์„ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์–ด, ๊ณ„์ธต ์ถ”์ถœ์ด ๊ถŒ์žฅ๋จ
ex) StratifiedShuffleSplit(), StratifiedKFold(), train_test_split()

# ์‚ฌ์ „ ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต ํ›„ ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜
def get_train_test_dataset(df=None):
    #์ธ์ž๋กœ ์ž…๋ ฅ๋œ DataFrame์˜ ์‚ฌ์ „ ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต์ด ์™„๋ฃŒ๋œ ๋ณต์‚ฌ DataFrame ๋ฐ˜ํ™˜
    df_copy = get_preprocessed_df(df)
    #DataFrame์˜ ๋งจ ๋งˆ์ง€๋ง‰ ์นผ๋Ÿผ์ด ๋ ˆ์ด๋ธ”, ๋‚˜๋จธ์ง€ ํ”ผ์ฒ˜๋“ค
    X_features = df_copy.iloc[:, :-1]
    y_target = df_copy.iloc[:,-1]
    #train_test_split()์œผ๋กœ ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„ํ• . stratify=y_target์œผ๋กœ Stratified ๊ธฐ๋ฐ˜ ๋ถ„ํ• 
    X_train, X_test, y_train, y_test = \
    train_test_split(X_features, y_target, test_size =0.3, random_state =0, stratify = y_target)
    #ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ฐ˜ํ™˜
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = get_train_test_dataset(card_df)
  • Logistic ํšŒ๊ท€์™€ LightGBM ํ›ˆ๋ จ

1) Logistic

[Output]
์˜ค์ฐจ ํ–‰๋ ฌ
[[85281    14]
 [   58    90]]
์ •ํ™•๋„: 0.9992, ์ •๋ฐ€๋„: 0.8654, ์žฌํ˜„์œจ: 0.6081, F1:0.7143, AUC:0.9703

2) LightGBM

[Output]
์˜ค์ฐจ ํ–‰๋ ฌ
[[85281    14]
 [   58    90]]
์ •ํ™•๋„: 0.9992, ์ •๋ฐ€๋„: 0.8654, ์žฌํ˜„์œจ: 0.6081, F1:0.7143, AUC:0.9703

LightGBM์ด ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ณด๋‹ค ๋†’์€ ์ˆ˜์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ„.

๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋„ ๋ณ€ํ™˜ ํ›„ ๋ชจ๋ธ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์™€ ๊ฐ™์ด ์„ ํ˜• ๋ชจ๋ธ์€ ์ค‘์š” ํ”ผ์ฒ˜๋“ค์˜ ๊ฐ’์ด ์ •๊ทœ ๋ถ„ํฌ ํ˜•ํƒœ๋ฅผ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์„ ์„ ํ˜ธ

  • Amount๋ฅผ ํ‘œ์ค€ ์ •๊ทœ ๋ถ„ํฌ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜

Amount feature ๋ถ„ํฌ๋„

from sklearn.preprocessing import StandardScaler
# ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ StandardScaler๋ฅผ ์ด์šฉํ•˜์—ฌ ์ •๊ทœ๋ถ„ํฌ ํ˜•ํƒœ๋กœ Amount ํ”ผ์ฒ˜๊ฐ’ ๋ณ€ํ™˜ํ•˜๋Š” ๋กœ์ง์œผ๋กœ ์ˆ˜์ •. 
def get_preprocessed_df(df=None):
    df_copy = df.copy()
    scaler = StandardScaler()
    # ๋ฐ์ดํ„ฐ ์ •๊ทœํ™”
    amount_n = scaler.fit_transform(df_copy['Amount'].values.reshape(-1, 1))
    # ๋ณ€ํ™˜๋œ Amount๋ฅผ Amount_Scaled๋กœ ํ”ผ์ฒ˜๋ช… ๋ณ€๊ฒฝํ›„ DataFrame๋งจ ์•ž ์ปฌ๋Ÿผ์œผ๋กœ ์ž…๋ ฅ
    df_copy.insert(0, 'Amount_Scaled', amount_n)
    # ๊ธฐ์กด Time, Amount ํ”ผ์ฒ˜ ์‚ญ์ œ
    df_copy.drop(['Time','Amount'], axis=1, inplace=True)
    return df_copy
  • ๋กœ๊ทธ ๋ณ€ํ™˜ loglp()

๋กœ๊ทธ ๋ณ€ํ™˜์€ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋„๊ฐ€ ์‹ฌํ•˜๊ฒŒ ์™œ๊ณก๋˜์–ด ์žˆ์„ ๊ฒฝ์šฐ ์ ์šฉํ•˜๋Š” ์ค‘์š” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜

def get_preprocessed_df(df=None):
    df_copy = df.copy()
    # ๋„˜ํŒŒ์ด์˜ log1p( )๋ฅผ ์ด์šฉํ•˜์—ฌ Amount๋ฅผ ๋กœ๊ทธ ๋ณ€ํ™˜ 
    amount_n = np.log1p(df_copy['Amount'])
    df_copy.insert(0, 'Amount_Scaled', amount_n)
    df_copy.drop(['Time','Amount'], axis=1, inplace=True)
    return df_copy

๋ ˆ์ด๋ธ”์ด ๊ทน๋„๋กœ ๋ถˆ๊ท ์ผํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ ์‹œ ์•ฝ๊ฐ„์€ ๋ถˆ์•ˆ์ •ํ•œ ์„ฑ๋Šฅ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คŒ

โœ… ์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ ํ›„ ๋ชจ๋ธ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€

  • IQR(Inter Quantile Range)

์‚ฌ๋ถ„์œ„ ๊ฐ’์˜ ํŽธ์ฐจ๋ฅผ ์ด์šฉํ•˜๋Š” ๊ธฐ๋ฒ•
25% ๊ตฌ๊ฐ„์ธ Q1 ~ 75% ๊ตฌ๊ฐ„์ธ Q3์˜ ๋ฒ”์œ„๋ฅผ IQR๋ผ๊ณ  ํ•จ.

  1. ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์ƒ์น˜๋กœ ๊ฐ„์ฃผ

  2. IQR ๋ฐฉ์‹์„ ์‹œ๊ฐํ™”ํ•œ ๋„ํ‘œ๊ฐ€ ๋ฐ•์Šค ํ”Œ๋กฏ

  • IQR ์ ์šฉ
  1. ๋จผ์ € ์–ด๋–ค ํ”ผ์ฒ˜์˜ ์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์ถœํ•  ๊ฒƒ์ธ์ง€ ์„ ํƒ์ด ํ•„์š”
    ๊ฒฐ์ •๊ฐ’๊ณผ ๊ฐ€์žฅ ์ƒ๊ด€์„ฑ์ด ๋†’์€ ํ”ผ์ฒ˜๋“ค์„ ์œ„์ฃผ๋กœ ์ด์ƒ์น˜๋ฅผ ๊ฒ€์ถœํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Œ

  2. V14์— ๋Œ€ํ•œ ์ด์ƒ์น˜๋ฅผ ์ฐพ์•„ ์ œ๊ฑฐ

import numpy as np

def get_outlier(df=None, column=None, weight=1.5):
    # fraud์— ํ•ด๋‹นํ•˜๋Š” column ๋ฐ์ดํ„ฐ๋งŒ ์ถ”์ถœ, 1/4 ๋ถ„์œ„์™€ 3/4 ๋ถ„์œ„ ์ง€์ ์„ np.percentile๋กœ ๊ตฌํ•จ. 
    fraud = df[df['Class']==1][column]
    quantile_25 = np.percentile(fraud.values, 25)
    quantile_75 = np.percentile(fraud.values, 75)
    # IQR์„ ๊ตฌํ•˜๊ณ , IQR์— 1.5๋ฅผ ๊ณฑํ•˜์—ฌ ์ตœ๋Œ€๊ฐ’๊ณผ ์ตœ์†Œ๊ฐ’ ์ง€์  ๊ตฌํ•จ. 
    iqr = quantile_75 - quantile_25
    iqr_weight = iqr * weight
    lowest_val = quantile_25 - iqr_weight
    highest_val = quantile_75 + iqr_weight
    # ์ตœ๋Œ€๊ฐ’ ๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜, ์ตœ์†Œ๊ฐ’ ๋ณด๋‹ค ์ž‘์€ ๊ฐ’์„ ์•„์›ƒ๋ผ์ด์–ด๋กœ ์„ค์ •ํ•˜๊ณ  DataFrame index ๋ฐ˜ํ™˜. 
    outlier_index = fraud[(fraud < lowest_val) | (fraud > highest_val)].index
    return outlier_index
  1. ์ด์ƒ์น˜๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์‚ญ์ œํ•˜๋Š” ๋กœ์ง ์ถ”๊ฐ€
# get_processed_df( )๋ฅผ ๋กœ๊ทธ ๋ณ€ํ™˜ ํ›„ V14 ํ”ผ์ฒ˜์˜ ์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ญ์ œํ•˜๋Š” ๋กœ์ง์œผ๋กœ ๋ณ€๊ฒฝ. 
def get_preprocessed_df(df=None):
    df_copy = df.copy()
    amount_n = np.log1p(df_copy['Amount'])
    df_copy.insert(0, 'Amount_Scaled', amount_n)
    df_copy.drop(['Time','Amount'], axis=1, inplace=True)
    # ์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ ์‚ญ์ œํ•˜๋Š” ๋กœ์ง ์ถ”๊ฐ€
    outlier_index = get_outlier(df=df_copy, column='V14', weight=1.5)
    df_copy.drop(outlier_index, axis=0, inplace=True)
    return df_copy

X_train, X_test, y_train, y_test = get_train_test_dataset(card_df)
print('### ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ์˜ˆ์ธก ์„ฑ๋Šฅ ###')
get_model_train_eval(lr_clf, ftr_train=X_train, ftr_test=X_test, tgt_train=y_train, tgt_test=y_test)
print('### LightGBM ์˜ˆ์ธก ์„ฑ๋Šฅ ###')
get_model_train_eval(lgbm_clf, ftr_train=X_train, ftr_test=X_test, tgt_train=y_train, tgt_test=y_test)

์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐํ•œ ๋’ค, ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์™€ LightGBM ๋ชจ๋‘ ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ

SMOTE ์˜ค๋ฒ„ ์ƒ˜ํ”Œ๋ง ์ ์šฉ ํ›„ ๋ชจ๋ธ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€

๋ฐ˜๋“œ์‹œ ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŠธ๋งŒ ์˜ค๋ฒ„ ์ƒ˜ํ”Œ๋ง์„ ํ•ด์•ผ ๋จ.
๊ฒ€์ฆ/ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ์ ์šฉ์‹œ ์˜ฌ๋ฐ”๋ฅธ ๊ฒ€์ฆ/ํ…Œ์ŠคํŠธ๊ฐ€ ๋  ์ˆ˜ ์—†์Œ.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=0)
X_train_over, y_train_over = smote.fit_resample(X_train, y_train)
print('SMOTE ์ ์šฉ ์ „ ํ•™์Šต์šฉ ํ”ผ์ฒ˜/๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์„ธํŠธ: ', X_train.shape, y_train.shape)
print('SMOTE ์ ์šฉ ํ›„ ํ•™์Šต์šฉ ํ”ผ์ฒ˜/๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์„ธํŠธ: ', X_train_over.shape, y_train_over.shape)
print('SMOTE ์ ์šฉ ํ›„ ๋ ˆ์ด๋ธ” ๊ฐ’ ๋ถ„ํฌ: \n', pd.Series(y_train_over).value_counts())
  • ์žฌํ˜„์œจ/์ •๋ฐ€๋„

๋„ˆ๋ฌด๋‚˜ ๋งŽ์€ Class=1 ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•˜๋ฉด์„œ ์‹ค์ œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ์˜ˆ์ธก์„ ์ง€๋‚˜์น˜๊ฒŒ Class=1๋กœ ์ ์šฉํ•ด ์ •๋ฐ€๋„๊ฐ€ ๊ธ‰๊ฒฉํžˆ ๋–จ์–ด์ง€๊ฒŒ ๋จ.

์žฌํ˜„์œจ ์ง€ํ‘œ๋ฅผ ๋†’์ด๋Š” ๊ฒƒ์ด ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ์ฃผ์š”ํ•œ ๋ชฉํ‘œ์ผ ๊ฒฝ์šฐ SMOTE๋ฅผ ์ ์šฉํ•˜๋ฉด ์ข‹์Œ.



Mechanisms of Action (MoA)

downloads

  • Mechanisms of Action

Mechanisms of Action

  • zip ํŒŒ์ผ ์—…๋กœ๋“œ ๋ฐ ์••์ถ• ํ•ด์ œ
import zipfile as zf
files = zf.ZipFile("lish-moa.zip", 'r')
files.extractall("MoA")
files.close()
  • category_encoders ์„ค์น˜
pip install category_encoders

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

  • Framing as a binary classification problem
SEED = 42
NFOLDS = 5
DATA_DIR = './MoA/'
np.random.seed(SEED)
  • ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ„๊ธฐ
train = pd.read_csv(DATA_DIR + 'train_features.csv')
targets = pd.read_csv(DATA_DIR + 'train_targets_scored.csv')

test = pd.read_csv(DATA_DIR+'test_features.csv')
sub = pd.read_csv(DATA_DIR+'sample_submission.csv')

#drop id col
X = train.iloc[:, 1:].to_numpy()
X_test = test.iloc[:, 1:].to_numpy()
y = targets.iloc[:, 1:].to_numpy()
  • MultiOutputClassifier
classifier = MultiOutputClassifier(XGBClassifier(tree_method='gpu_hist'))

clf = Pipeline([('encode', CountEncoder(cols=[0, 2])),
                ('classify', classifier)
               ])
  • XGBClassifier ํ•˜์ดํผ ํŒŒ๋ฆฌ๋ฏธํ„ฐ ์„ค์ •
params = {'classify__estimator__colsample_bytree': 0.6522,
          'classify__estimator__gamma': 3.6975,
          'classify__estimator__learning_rate': 0.0503,
          'classify__estimator__max_delta_step': 2.0706,
          'classify__estimator__max_depth': 10,
          'classify__estimator__min_child_weight': 31.5800,
          'classify__estimator__n_estimators': 166,
          'classify__estimator__subsample': 0.8639
         }

_ = clf.set_params(**params)

MultiOutputClassifier๋กœ ํ›ˆ๋ จ

  • kfold ๊ต์ฐจ ๊ฒ€์ฆ
oof_preds = np.zeros(y.shape) #OOF, Out of folds
test_preds = np.zeros((test.shape[0], y.shape[1]))
oof_losses = []
kf = KFold(n_splits = NFOLDS)

#5๊ฐœ k-fold ๋ฐฉ์‹ ์ ์šฉ

for fn, (trn_idx, val_idx) in enumerate(kf.split(X, y)):
    print('Starting fold: ', fn)
    #kf.split์œผ๋กœ ์ถ”์ถœ๋œ ํ•™์Šต๊ณผ ๊ฒ€์ฆ index๊ฐ’์œผ๋กœ ํ•™์Šต๊ณผ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ถ„๋ฆฌ
    X_train, X_val = X[trn_idx], X[val_idx]
    y_train, y_val = y[trn_idx], y[val_idx]
    
    #drop where cp_type==ct1_vehichle (baseline)
    ct1_mask = X_train[:, 0] == 'ct1_vehicle'
    X_train = X_train[~ct1_mask, :]
    y_train = y_train[~ct1_mask]

    #MultiOutputClassifier ํ•™์Šต ์ˆ˜ํ–‰ 
    clf.fit(X_train, y_train)
    
    val_preds = clf.predict_proba(X_val) #list of preds per class
    val_preds = np.array(val_preds)[:,:,1].T #take the positive class

    oof_preds[val_idx] = val_preds
    
    loss = log_loss(np.ravle(y_val), np.ravel(val_preds))
    oof_losses.append(loss)
    preds = clf.predict_proba(X_test)
    preds = np.array(preds)[:,:,1].T #take the postiive class
    test_preds += preds / NFOLDS
    
print(oof_losses)
print('Mean OOF loss across folds', np.mean(oof_losses))
print('STD OOF loss across folds', np.std(oof_losses))    
[Output]

Starting fold:  0
Starting fold:  1
Starting fold:  2
Starting fold:  3
Starting fold:  4
[0.0169781773377249, 0.01704491710861325, 0.016865153552168475, 0.01700900926983899, 0.01717882474706338]
Mean OOF loss across folds 0.017015216403081797
STD OOF loss across folds 0.00010156682747757948
  • log_loss ์ถœ๋ ฅ
# set control train preds to 0
control_mask = train['cp_type']=='ctl_vehicle'
oof_preds[control_mask] = 0

print('OOF log loss: ', log_loss(np.ravel(y), np.ravel(oof_preds)))

Analysis of OOF preds

  • submission.csv ์ƒ์„ฑ
# set conrol test pres to 0
control_mask = test['cp_type'] == 'ctl_vehicle'

test_preds[control_mask] = 0

#create the submission file
sub.iloc[:,1:] = test_preds
sub.to_csv('submission.csv', index=False)

profile
for well-being we need nectar and ambrosia

0๊ฐœ์˜ ๋Œ“๊ธ€