해당 글은 제로베이스데이터스쿨 학습자료를 참고하여 작성되었습니다
import pandas as pd
data_path = './data/creditcard.csv'
raw_data = pd.read_csv(data_path)
raw_data.head()
raw_data.columns
-----------------------------------------------------------------------------
Time V1 V2 V3 V4 V5 V6 V7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
V8 V9 ... V21 V22 V23 V24 V25 \
0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539
1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170
2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642
3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376
4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010
V26 V27 V28 Amount Class
0 -0.189115 0.133558 -0.021053 149.62 0
1 0.125895 -0.008983 0.014724 2.69 0
2 -0.139097 -0.055353 -0.059752 378.66 0
3 -0.221929 0.062723 0.061458 123.50 0
4 0.502292 0.219422 0.215153 69.99 0
[5 rows x 31 columns]
import seaborn as sns
import matplotlib.pyplot as plt
count_plot = sns.countplot(data=raw_data, x='Class')
plt.title('Class Distributions \n (0 : No Fraud || 1 : Fraud)', fontsize=14)
for p in count_plot.patches:
height = p.get_height()
count_plot.text(p.get_x() + p.get_width() / 2., height * 1.01, int(height), ha = 'center', size = 9)
plt.show()
from sklearn.model_selection import train_test_split
X = raw_data.iloc[:, 1:-1] # Featueres Data
y = raw_data.iloc[:, -1] # Labels Data
print('Data Shape : ', X.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13, stratify=y)
------------------------------------
Data Shape : (284807, 29) (284807,)
import numpy as np
print('Fraud Data Ratio of "y_train": ', np.unique(y_train, return_counts=True)[1][1] / len(y_train) * 100, '%')
------------------------------------------------------
Fraud Data Ratio of "y_train": 0.17254870488152324 %
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
def get_clf_eval(y_test, pred):
acc = accuracy_score(y_test, pred)
pre = precision_score(y_test, pred)
re = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)
auc = roc_auc_score(y_test, pred)
return acc, pre, re, f1, auc
from sklearn.metrics import confusion_matrix
def print_clf_eval(y_test, pred):
confusion = confusion_matrix(y_test, pred)
acc, pre, re, f1, auc = get_clf_eval(y_test, pred)
print('=> confusion matrix')
print(confusion)
print('======================')
print(f'Accuacy: {acc:.4f}, Precision: {pre:.4f}')
print(f'Recall: {re:.4f}, F1: {f1:.4f}, AUC: {auc:.4f}')
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression(random_state=13, solver='liblinear')
lr_clf.fit(X_train, y_train)
lr_pred = lr_clf.predict(X_test)
print('Logistic Regression')
print_clf_eval(y_test, lr_pred)
---------------------------------------
Logistic Regression
=> confusion matrix
[[85284 11]
[ 60 88]]
======================
Accuacy: 0.9992, Precision: 0.8889
Recall: 0.5946, F1: 0.7126, AUC: 0.7972
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(random_state=13, max_depth=4)
dt_clf.fit(X_train, y_train)
dt_pred = dt_clf.predict(X_test)
print('Decision Tree')
print_clf_eval(y_test, dt_pred)
------------------------------------------
Decision Tree
=> confusion matrix
[[85281 14]
[ 42 106]]
======================
Accuacy: 0.9993, Precision: 0.8833
Recall: 0.7162, F1: 0.7910, AUC: 0.8580
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(random_state=13, n_jobs=6, n_estimators=100)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
print('Random Forest')
print_clf_eval(y_test, rf_pred)
------------------------------------------
Random Forest
=> confusion matrix
[[85290 5]
[ 38 110]]
======================
Accuacy: 0.9995, Precision: 0.9565
Recall: 0.7432, F1: 0.8365, AUC: 0.8716
from lightgbm import LGBMClassifier
lgbm_clf = LGBMClassifier(n_estimators=1000, num_leaves=64, n_jobs=6, boost_from_average=False)
lgbm_clf.fit(X_train, y_train)
lgbm_pred = lgbm_clf.predict(X_test)
print('Light GBM')
print_clf_eval(y_test, lgbm_pred)
--------------------------------------------
Light GBM
=> confusion matrix
[[85289 6]
[ 34 114]]
======================
Accuacy: 0.9995, Precision: 0.9500
Recall: 0.7703, F1: 0.8507, AUC: 0.8851
- 은행입장 : Recall이 높으면 좋을 것
- 모든 범죄를 검거하고 싶음
- 사용자 입장 : Precision이 높으면 좋을 것
- 정상 사용자가 의심 받을 수 있음(귀찮)
- 가장 좋은 것은 둘다 좋은 F1-Score가 높은 것
def get_result(model, X_train, y_train, X_test, y_test):
model.fit(X_train, y_train)
pred = model.predict(X_test)
return get_clf_eval(y_test, pred)
def get_result_pd(models, model_names, X_train, y_train, X_test, y_test):
col_names = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
tmp = []
for model in models:
tmp.append(get_result(model, X_train, y_train, X_test, y_test))
return pd.DataFrame(tmp, columns=col_names, index=model_names)
import time
models = [lr_clf, dt_clf, rf_clf, lgbm_clf]
model_names = ['LogisticReg.', 'DecisionTree', 'RandomForest', 'LightGBM']
start_time = time.time()
results = get_result_pd(models, model_names, X_train, y_train, X_test, y_test)
print('Fit time : ', time.time() - start_time)
results
----------------------------------------------------------------
Fit time : 107.24186658859253
accuracy precision recall f1 roc_auc
LogisticReg. 0.999169 0.888889 0.594595 0.712551 0.797233
DecisionTree 0.999345 0.883333 0.716216 0.791045 0.858026
RandomForest 0.999497 0.956522 0.743243 0.836502 0.871592
LightGBM 0.999532 0.950000 0.770270 0.850746 0.885100
plt.figure(figsize=(10,5))
sns.distplot(raw_data['Amount'])
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
amount_n = scaler.fit_transform(raw_data['Amount'].values.reshape(-1, 1))
raw_data_copy = raw_data.iloc[:, 1:-2]
raw_data_copy['Amount_Scaled'] = amount_n
raw_data_copy.head()
-------------------------------------------------------------------------------
V1 V2 V3 V4 V5 V6 V7 \
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
V8 V9 V10 ... V20 V21 V22 V23 \
0 0.098698 0.363787 0.090794 ... 0.251412 -0.018307 0.277838 -0.110474
1 0.085102 -0.255425 -0.166974 ... -0.069083 -0.225775 -0.638672 0.101288
2 0.247676 -1.514654 0.207643 ... 0.524980 0.247998 0.771679 0.909412
3 0.377436 -1.387024 -0.054952 ... -0.208038 -0.108300 0.005274 -0.190321
4 -0.270533 0.817739 0.753074 ... 0.408542 -0.009431 0.798278 -0.137458
V24 V25 V26 V27 V28 Amount_Scaled
0 0.066928 0.128539 -0.189115 0.133558 -0.021053 0.244964
1 -0.339846 0.167170 0.125895 -0.008983 0.014724 -0.342475
2 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 1.160686
3 -1.175575 0.647376 -0.221929 0.062723 0.061458 0.140534
4 0.141267 -0.206010 0.502292 0.219422 0.215153 -0.073403
[5 rows x 29 columns]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13, stratify=y)
models = [lr_clf, dt_clf, rf_clf, lgbm_clf]
model_names = ['LogisticReg.', 'DecisionTree', 'RandomForest', 'LightGBM']
start_time = time.time()
results = get_result_pd(models, model_names, X_train, y_train, X_test, y_test)
print('Fit time : ', time.time() - start_time)
results
-----------------------------------------------------------------
Fit time : 101.21563196182251
accuracy precision recall f1 roc_auc
LogisticReg. 0.999169 0.888889 0.594595 0.712551 0.797233
DecisionTree 0.999345 0.883333 0.716216 0.791045 0.858026
RandomForest 0.999497 0.956522 0.743243 0.836502 0.871592
LightGBM 0.999532 0.950000 0.770270 0.850746 0.885100
from sklearn.metrics import roc_curve
def draw_roc_curve(models, model_names, X_test, y_test):
plt.figure(figsize=(10,10))
for model in range(len(models)):
pred = models[model].predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, pred)
plt.plot(fpr, tpr, label=model_names[model])
plt.plot([0,1],[0,1], 'k--', label='random quess')
plt.title('ROC')
plt.legend()
plt.xlabel('FPR'); plt.ylabel('TPR')
plt.grid()
plt.show()
draw_roc_curve(models, model_names, X_test, y_test)
amount_log = np.log1p(raw_data['Amount'])
raw_data_copy['Amount_Scaled'] = amount_log
raw_data_copy.head()
-------------------------------------------------------------------------------
V1 V2 V3 V4 V5 V6 V7 \
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
V8 V9 V10 ... V20 V21 V22 V23 \
0 0.098698 0.363787 0.090794 ... 0.251412 -0.018307 0.277838 -0.110474
1 0.085102 -0.255425 -0.166974 ... -0.069083 -0.225775 -0.638672 0.101288
2 0.247676 -1.514654 0.207643 ... 0.524980 0.247998 0.771679 0.909412
3 0.377436 -1.387024 -0.054952 ... -0.208038 -0.108300 0.005274 -0.190321
4 -0.270533 0.817739 0.753074 ... 0.408542 -0.009431 0.798278 -0.137458
V24 V25 V26 V27 V28 Amount_Scaled
0 0.066928 0.128539 -0.189115 0.133558 -0.021053 5.014760
1 -0.339846 0.167170 0.125895 -0.008983 0.014724 1.305626
2 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 5.939276
3 -1.175575 0.647376 -0.221929 0.062723 0.061458 4.824306
4 0.141267 -0.206010 0.502292 0.219422 0.215153 4.262539
[5 rows x 29 columns]
plt.figure(figsize=(10, 5))
sns.distplot(raw_data_copy['Amount_Scaled'], color='r')
X_train, X_test, y_train, y_test = train_test_split(raw_data_copy, y, test_size=0.3, random_state=13, stratify=y)
start_time = time.time()
results = get_result_pd(models, model_names, X_train, y_train, X_test, y_test)
print('Fit time : ', time.time() - start_time)
results
-----------------------------------------------------------------
Fit time : 94.71821999549866
accuracy precision recall f1 roc_auc
LogisticReg. 0.999157 0.887755 0.587838 0.707317 0.793854
DecisionTree 0.999345 0.883333 0.716216 0.791045 0.858026
RandomForest 0.999497 0.956522 0.743243 0.836502 0.871592
LightGBM 0.999532 0.950000 0.770270 0.850746 0.885100
draw_roc_curve(models, model_names, X_test, y_test)
sns.boxplot(data=raw_data[['V13', 'V14', 'V15']])
def get_outlier(df=None, column=None, weight=1.5):
fraud = df[df['Class']==1][column]
quantile_25 = np.percentile(fraud.values, 25)
quantile_75 = np.percentile(fraud.values, 75)
iqr = quantile_75 - quantile_25
iqr_weight = iqr * weight
lowest_val = quantile_25 - iqr_weight
highest_val = quantile_75 + iqr_weight
outlier_index = fraud[(fraud < lowest_val) | (fraud > highest_val)].index
return outlier_index
get_outlier(df=raw_data, column='V14', weight=1.5)
---------------------------------------------------
Int64Index([8296, 8615, 9035, 9252], dtype='int64')
print('삭제 전 : ', raw_data_copy.shape)
outlier_index = get_outlier(df=raw_data, column='V14', weight=1.5)
raw_data_copy.drop(outlier_index, axis=0, inplace=True)
print('삭제 후 : ', raw_data_copy.shape)
-----------------------------------------
삭제 전 : (284807, 29)
삭제 후 : (284803, 29)
X = raw_data_copy
raw_data.drop(outlier_index, axis=0, inplace=True)
y = raw_data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13, stratify=y)
start_time = time.time()
results = get_result_pd(models, model_names, X_train, y_train, X_test, y_test)
print('Fit time : ', time.time() - start_time)
results
draw_roc_curve(models, model_names, X_test, y_test)
----------------------------------------------------------------
Fit time : 122.6107828617096
accuracy precision recall f1 roc_auc
LogisticReg. 0.999286 0.904762 0.650685 0.756972 0.825284
DecisionTree 0.999427 0.870229 0.780822 0.823105 0.890311
RandomForest 0.999497 0.918699 0.773973 0.840149 0.886928
LightGBM 0.999602 0.951613 0.808219 0.874074 0.904074
draw_roc_curve(models, model_names, X_test, y_test)
KNN(K-Nearest_Neighber)이란
새로운 데이터가 있을 때, K갯수에 해당하는 주변 데이터가 어느 그룹에 속하는지를 참고하여 새로운 데이터의 그룹을 결정하는 방법
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=13)
X_train_over, y_train_over = smote.fit_resample(X_train, y_train)
print('증강 전: ', np.unique(y_train, return_counts=True))
print('증강 후: ', np.unique(y_train_over, return_counts=True))
---------------------------------------------------------------------------------
증강 전: (array([0, 1], dtype=int64), array([199020, 342], dtype=int64))
증강 후: (array([0, 1], dtype=int64), array([199020, 199020], dtype=int64))
start_time = time.time()
results = get_result_pd(models, model_names, X_train_over, y_train_over, X_test, y_test)
print('Fit time : ', time.time() - start_time)
results
-----------------------------------------------------------------
Fit time : 219.28552627563477
accuracy precision recall f1 roc_auc
LogisticReg. 0.975609 0.059545 0.897260 0.111679 0.936502
DecisionTree 0.968984 0.046048 0.869863 0.087466 0.919509
RandomForest 0.999532 0.873239 0.849315 0.861111 0.924552
LightGBM 0.999532 0.873239 0.849315 0.861111 0.924552
draw_roc_curve(models, model_names, X_test, y_test)