๐Ÿ’  AIchemist 2th Session | ํ‰๊ฐ€

yellowsubmarine372ยท2023๋…„ 9์›” 25์ผ

AIchemist

๋ชฉ๋ก ๋ณด๊ธฐ
4/14
post-thumbnail

ํ‰๊ฐ€ ๐Ÿ“ˆ

๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ. ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋ชจ๋ธ์ด ๋ถ„๋ฅ˜๋‚˜ ํšŒ๊ท€๋ƒ์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜๋กœ ๋‚˜๋‰จ.

๋ถ„๋ฅ˜์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ

๐Ÿ”น ์ •ํ™•๋„
๐Ÿ”น ์˜ค์ฐจํ–‰๋ ฌ
๐Ÿ”น ์ •๋ฐ€๋„
๐Ÿ”น ์žฌํ˜„์œจ
๐Ÿ”น F1 ์Šค์ฝ”์–ด
๐Ÿ”น ROC AUC

01. ์ •ํ™•๋„

์ •ํ™•๋„๋Š” ์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ ์˜ˆ์ธก ๋ฐ์ดํ„ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ฐ™์€์ง€๋ฅผ ํŒ๋‹จํ•˜๋Š” ์ง€ํ‘œ

  • ๋‹จ์ˆœํ•œ Classifier ์ƒ์„ฑ
from sklearn.base import BaseEstimator

class MyDummyClassifier(BaseEstimator):
    # fit() ๋ฉ”์„œ๋“œ๋Š” ์•„๋ฌด๊ฒƒ๋„ ํ•™์Šตํ•˜์ง€ ์•Š์Œ.
    def fit(self, X, y=None):
        pass
    # predict() ๋ฉ”์„œ๋“œ๋Š” ๋‹จ์ˆœํžˆ Sex ํ”ผ์ฒ˜๊ฐ€ 1์ด๋ฉด 0, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด 1๋กœ ์˜ˆ์ธกํ•จ.
    def predict(self, X):
        pred = np.zeros((X.shape[0], 1))
        for i in range(X.shape[0]):
            if X['Sex'].iloc[i] == 1:
                pred[i] = 0
            else:
                pred[i] = 1
        return pred

ํƒ€์ดํƒ€๋‹‰ ์ƒ์กด์ž ์˜ˆ์ธก

## ์ƒ์„ฑ๋œ MyDummyClassifier๋ฅผ ์ด์šฉํ•ด ํƒ€์ดํƒ€๋‹‰ ์ƒ์กด์ž ์˜ˆ์ธก ์ˆ˜ํ–‰

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

## Null ์ฒ˜๋ฆฌ ํ•จ์ˆ˜
def fillna(df):
    df['Age'].fillna(df['Age'].mean(), inplace=True)
    df['Cabin'].fillna('N', inplace=True)
    df['Embarked'].fillna('N', inplace=True)
    df['Fare'].fillna(0, inplace=True)
    return df

## ๋จธ์‹ ๋Ÿฌ๋‹์— ๋ถˆํ•„์š”ํ•œ ํ”ผ์ฒ˜ ์ œ๊ฑฐ
def drop_features(df):
    df.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)
    return df

## Label Encoding ์ˆ˜ํ–‰
def format_features(df):
    df['Cabin'] = df['Cabin'].str[:1]
    features = ['Cabin', 'Sex', 'Embarked']
    for feature in features:
        le = LabelEncoder()
        le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df

## ์•ž์—์„œ ์‹คํ–‰ํ•œ Data Preprocessing ํ•จ์ˆ˜ ํ˜ธ์ถœ
def transform_features(df):
    df = fillna(df)
    df = drop_features(df)
    df = format_features(df)
    return df
    
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ๋กœ๋”ฉ, ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต, ํ•™์Šต ๋ฐ์ดํ„ฐ/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
titanic_df = pd.read_csv('titanic_train.csv')
y_titanic_df = titanic_df['Survived']
X_titanic_df = titanic_df.drop('Survived', axis = 1)
X_titanic_df = transform_features(X_titanic_df)
#Survived ์นผ๋Ÿผ์œผ๋กœ target ์„ค์ •
X_train, X_test, y_train, y_test = train_test_split(X_titanic_df, y_titanic_df, test_size=0.2, random_state=0)

# ์œ„์—์„œ ์ƒ์„ฑํ•œ Dummy Classifier๋ฅผ ์ด์šฉํ•ด ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€ ์ˆ˜ํ–‰
myclf = MyDummyClassifier()
myclf.fit(X_train, y_train)

mypredictions = myclf.predict(X_test)
print('Dummy Classifier์˜ ์ •ํ™•๋„๋Š”: {0:.4f}'.format(accuracy_score(y_test, mypredictions)))
[Output] 
Dummy Classifier์˜ ์ •ํ™•๋„๋Š”: 0.7877

์ด๋ ‡๊ฒŒ ๋‹จ์ˆœํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ •ํ™•๋„๊ฐ€ ๋†’์œผ๋ฉด ์•ˆ๋จ
๋˜ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋„๊ฐ€ ๊ท ์ผํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋†’์€ ์ˆ˜์น˜๊ฐ€ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด ์ •ํ™•๋„ ํ‰๊ฐ€ ์ง€ํ‘œ์˜ ๋งน์ 

02. ์˜ค์ฐจ ํ–‰๋ ฌ

์˜ค์ฐจ ํ–‰๋ ฌ์—์„œ TN, FP, FN, TP ๊ฐ’์„ ๋‹ค์–‘ํ•˜๊ฒŒ ๊ฒฐํ•ฉํ•ด ๋ถ„๋ฅ˜ ๋ชจ๋ธ ์˜ˆ์ธก ์„ฑ๋Šฅ์˜ ์˜ค๋ฅ˜๊ฐ€ ์–ด๋– ํ•œ ๋ชจ์Šต์œผ๋กœ ๋ฐœ์ƒํ•˜๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ

True/False ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์ด '๊ฐ™์€๊ฐ€/ํ‹€๋ฆฐ๊ฐ€'๋ฅผ ์˜๋ฏธ
Negative/Positive ์˜ˆ์ธก ๊ฒฐ๊ณผ ๊ฐ’์ด ๋ถ€์ •(0)/๊ธ์ •(1)์„ ์˜๋ฏธ

  • TN (True Negative)
    ์˜ˆ์ธก ํด๋ž˜์Šค ๊ฐ’๊ณผ ์‹ค์ œ ํด๋ž˜์Šค ๊ฐ’์ด Negative๋กœ ๊ฐ™์Œ
  • FP (False Positive)
    ์˜ˆ์ธก๊ฐ’์€ Positive, ์‹ค์ œ ๊ฐ’์€ Negative๋กœ ๋‹ค๋ฆ„
  • FN (False Negative)
    ์˜ˆ์ธก๊ฐ’์€ Negative, ์‹ค์ œ ๊ฐ’์€ Positive
  • TP (True Positive)
    ์˜ˆ์ธก ํด๋ž˜์Šค ๊ฐ’๊ณผ ์‹ค์ œ ํด๋ž˜์Šค ๊ฐ’์ด Positive๋กœ ๊ฐ™์Œ
  • ์ •ํ™•๋„

์ •ํ™•๋„๋Š” ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ ๊ฐ’์ด ์–ผ๋งˆ๋‚˜ ๋™์ผํ•œ๊ฐ€์— ๋Œ€ํ•œ ๋น„์œจ๋งŒ์œผ๋กœ ๊ฒฐ์ •


์ค‘์ ์ ์œผ๋กœ ์ฐพ์•„์•ผ ํ•˜๋Š” ๋งค์šฐ ์ ์€ ์ˆ˜์˜ ๊ฒฐ๊ด๊ฐ’์— Positive๋ฅผ ์„ค์ •, ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ Negative๋ฅผ ์„ค์ •
Positive ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜๊ฐ€ ์ž‘๊ธฐ ๋•Œ๋ฌธ์— Negative๋กœ ์˜ˆ์ธก ์ •ํ™•๋„๊ฐ€ ๋†’์•„์ง€๋Š” ๊ฒฝํ–ฅ์ด ๋ฐœ์ƒ โ†’ ๋น„๋Œ€์นญํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ Positive์— ๋Œ€ํ•œ ์˜ˆ์ธก ์ •ํ™•๋„๋ฅผ ํŒ๋‹จํ•˜์ง€ ๋ชปํ•œ ์ฑ„ Negative์— ๋Œ€ํ•œ ์˜ˆ์ธก ์ •ํ™•๋„๋งŒ์œผ๋กœ๋„ ๋ถ„๋ฅ˜์˜ ์ •ํ™•๋„๊ฐ€ ๋งค์šฐ ๋†’๊ฒŒ ๋‚˜ํƒ€๋‚จ

03. ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ

๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ์„ ํ˜ธ๋จ(์˜ค์ฐจํ–‰๋ ฌ์˜ ํ•œ๊ณ„ ๊ทน๋ณต)
Positive ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ์— ์ข€ ๋” ์ดˆ์ ์„ ๋งž์ถ˜ ํ‰๊ฐ€ ์ง€ํ‘œ

  • ์ •๋ฐ€๋„

์˜ˆ์ธก์„ Positive๋กœ ํ•œ ๋Œ€์ƒ ์ค‘์— ์˜ˆ์ธก๊ณผ ์‹ค์ œ ๊ฐ’์ด Positive๋กœ ์ผ์น˜ํ•œ ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ
์‹ค์ œ Negative ์Œ์„ฑ์ธ ๋ฐ์ดํ„ฐ ์˜ˆ์ธก์„ Positive ์–‘์„ฑ์œผ๋กœ ์ž˜๋ชป ํŒ๋‹จํ•˜๊ฒŒ ๋˜๋ฉด ์—…๋ฌด์ƒ ํฐ ์˜ํ–ฅ์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ(์•„์˜ˆ ์˜ค๋ฅ˜ ์ฐจ๋‹จํ•˜๋ฉด ์•ˆ๋˜๋Š” ๊ฒฝ์šฐ)

ex) ์ŠคํŒธ๋ฉ”์ผ ์—ฌ๋ถ€ ํŒ๋‹จ ๋ชจ๋ธ

  • ์žฌํ˜„์œจ

์‹ค์ œ ๊ฐ’์ด Positive ๋Œ€์ƒ ์ค‘์— ์˜ˆ์ธก๊ณผ ์‹ค์ œ ๊ฐ’์ด Positive๋กœ ์ผ์น˜ํ•œ ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ
์žฌํ˜„์œจ์ด ์ค‘์š” ์ง€ํ‘œ์ธ ๊ฒฝ์šฐ๋Š” ์‹ค์ œ Positive ์–‘์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ Negative๋กœ ์ž˜๋ชป ํŒ๋‹จํ•˜๊ฒŒ ๋˜๋ฉด ์—…๋ฌด์ƒ ํฐ ์˜ํ–ฅ์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ (์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ์•ˆ๋˜๋Š” ๊ฒฝ์šฐ)

ex) ์•” ํŒ๋‹จ ๋ชจ๋ธ

  • get_clf_eval()

confusion_matrix, accuracy, precision, recall ํ‰๊ฐ€๋ฅผ ํ•œ๊บผ๋ฒˆ์— ํ˜ธ์ถœ

from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

def get_clf_eval(y_test, pred):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    print('์˜ค์ฐจ ํ–‰๋ ฌ')
    print(confusion)
    print('์ •ํ™•๋„: {0:.4f}, ์ •๋ฐ€๋„: {1:.4f}, ์žฌํ˜„์œจ: {2:.4f}'.format(accuracy, precision, recall))
  • LogisticRegression์œผ๋กœ ํ‰๊ฐ€

์ด์ง„ ๋ถ„๋ฅ˜(solver = liblinear)๋กœ ์‹œํ–‰

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ๋กœ๋”ฉ, ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต, ํ•™์Šต ๋ฐ์ดํ„ฐ/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
titanic_df = pd.read_csv('titanic_train.csv')
y_titanic_df = titanic_df['Survived']
X_titanic_df = titanic_df.drop('Survived', axis=1)
X_titanic_df = transform_features(X_titanic_df)

X_train, X_test, y_train, y_test = train_test_split(X_titanic_df, y_titanic_df, test_size = 0.20, random_state=11)

lr_clf = LogisticRegression(solver='liblinear')

lr_clf.fit(X_train, y_train)

pred = lr_clf.predict(X_test)
get_clf_eval(y_test, pred)
[Output]
์˜ค์ฐจ ํ–‰๋ ฌ
[[108  10]
 [ 14  47]]
์ •ํ™•๋„" 0.8659, ์ •๋ฐ€๋„: 0.8246, ์žฌํ˜„์œจ: 0.7705
  • ์ •๋ฐ€๋„/์žฌํ˜„์œจ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

Threshold๋ฅผ ์กฐ์ ˆํ•ด ์ •๋ฐ€๋„/์žฌํ˜„์œจ ์กฐ์ • ๊ฐ€๋Šฅ
์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์€ ์ƒํ˜ธ๋ณด์™„์ ์ธ ํ‰๊ฐ€ ์ง€ํ‘œ์ด๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋А ํ•œ์ชฝ์„ ๊ฐ•์ œ๋กœ ๋†’์ด๋ฉด ๋‹ค๋ฅธ ํ•˜๋‚˜์˜ ์ˆ˜์น˜๋Š” ๋–จ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ํŠธ๋ ˆ์ด๋“œ ์˜คํ”„๋ผ ๋ถ€๋ฆ„

์ผ๋ฐ˜์ ์œผ๋กœ ์ด์ง„๋ถ„๋ฅ˜์—์„œ๋Š” ๊ณตํ‰ํ•˜๊ฒŒ threshold๋ฅผ ๊ณตํ‰ํ•˜๊ฒŒ 50%๋กœ ์„ค์ •

  • predict_proba()

predict()๋ฉ”์„œ๋“œ์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ๋ฐ˜ํ™˜ ๊ฐ’์ด ํด๋ž˜์Šค ๊ฐ’์ด ์•„๋‹Œ ์˜ˆ์ธก ํ™•๋ฅ  ๊ฒฐ๊ณผ

  • predict probality๋ž‘ predict class๋ฅผ ํ•จ๊ป˜ ์ถœ๋ ฅ
pred_proba = lr_clf.predict_proba(X_test)
pred = lr_clf.predict(X_test)
print('pred_proba()๊ฒฐ๊ณผ Shape : {0}'.format(pred_proba.shape))
print('pred_proba array์—์„œ ์•ž 3๊ฐœ๋งŒ ์ƒ˜ํ”Œ๋กœ ์ถ”์ถœ \n:',pred_proba[:3])

# ์˜ˆ์ธก ํ™•๋ฅ  array์™€ ์˜ˆ์ธก ๊ฒฐ๊ด๊ฐ’ array๋ฅผ ๋ณ‘ํ•ฉ(concatenate)ํ•ด ์˜ˆ์ธก ํ™•๋ฅ ๊ณผ ๊ฒฐ๊ด๊ฐ’์„ ํ•œ๋ˆˆ์— ํ™•์ธ 
pred_proba_result = np.concatenate([pred_proba, pred.reshape(-1, 1)], axis=1)
print('๋‘ ๊ฐœ์˜ class ์ค‘์—์„œ ๋” ํฐ ํ™•๋ฅ ์„ ํด๋ž˜์Šค ๊ฐ’์œผ๋กœ ์˜ˆ์ธก \n', pred_proba_result[:3])
[Output]

pred_proba()๊ฒฐ๊ณผ Shape : (179, 2)
pred_proba array์—์„œ ์•ž 3๊ฐœ๋งŒ ์ƒ˜ํ”Œ๋กœ ์ถ”์ถœ 
: [[0.44935225 0.55064775]
 [0.86335511 0.13664489]
 [0.86429643 0.13570357]]
๋‘ ๊ฐœ์˜ class ์ค‘์—์„œ ๋” ํฐ ํ™•๋ฅ ์„ ํด๋ž˜์Šค ๊ฐ’์œผ๋กœ ์˜ˆ์ธก 
 [[0.44935225 0.55064775 1.        ]
 [0.86335511 0.13664489 0.        ]
 [0.86429643 0.13570357 0.        ]]

2๊ฐœ์˜ ์นผ๋Ÿผ ์ค‘ ๋” ํฐ ํ™•๋ฅ  ๊ฐ’์œผ๋กœ predict() ๋ฉ”์„œ๋“œ๊ฐ€ ์ตœ์ข… ์˜ˆ์ธก

  • Threshold ๋ณ€ํ™”
from sklearn.preprocessing import Binarizer

# Binarizer์˜ threshold์˜ ์„ค์ •๊ฐ’. ๋ถ„๋ฅ˜ ๊ฒฐ์ • ์ž„๊ณ„๊ฐ’์ž„.
custom_threshold = 0.5

# predict_proba() ๋ฐ˜ํ™˜๊ฐ’์˜ ๋‘ ๋ฒˆ์งธ ์นผ๋Ÿผ, ์ฆ‰ positive ํด๋ž˜์Šค ์นผ๋Ÿผ ํ•˜๋‚˜๋งŒ ์ถ”์ถœํ•ด Binarizer๋ฅผ ์ ์šฉ
pred_proba_1 = pred_proba[:,1].reshape(-1, 1)

binarizer = Binarizer(threshold = custom_threshold).fit(pred_proba_1)
custom_predict = binarizer.transform(pred_proba_1)

get_clf_eval(y_test, custom_predict)

# Binarizer์˜ threshold์˜ ์„ค์ •๊ฐ’์„ 0.4๋กœ ์„ค์ •. ์ฆ‰ ๋ถ„๋ฅ˜ ๊ฒฐ์ • ์ž„๊ณ„๊ฐ’์„ 0.5์—์„œ 0.4๋กœ ๋‚ฎ์ถค
custom_threshold = 0.4
pred_proba_1 = pred_proba[:,1].reshape(-1, 1)
binarizer = Binarizer(threshold=custom_threshold).fit(pred_proba_1)
custom_predict = binarizer.transform(pred_proba_1) #pred_proba_1์€ ์ œ์‹œํ•˜๋Š” ์˜ˆ์ธก ๊ฒฐ๊ณผ

get_clf_eval(y_test, custom_predict)
[Output]
์˜ค์ฐจ ํ–‰๋ ฌ
[[97 21]
 [11 50]]
์ •ํ™•๋„: 0.8212, ์ •๋ฐ€๋„: 0.7042, ์žฌํ˜„์œจ: 0.8197

Threshold๋ฅผ ๋‚ฎ์ถ”๋‹ˆ ์žฌํ˜„์œจ ๊ฐ’์ด ์˜ฌ๋ผ๊ฐ€๊ณ  ์ •๋ฐ€๋„๊ฐ€ ๋–จ์–ด์ง
Threshold์€ Positive ์˜ˆ์ธก๊ฐ’์„ ๊ฒฐ์ •ํ•˜๋Š” ํ™•๋ฅ ์˜ ๊ธฐ์ค€์ด ๋จ. Threshold ๊ฐ’์„ ๋‚ฎ์ถœ์ˆ˜๋ก True ๊ฐ’์ด ๋งŽ์•„์ง

import matplotlib.pyplot as plt
import matplotlib.ticker as tiker 
%matplotlib inline

def precision_recall_curve_plot(y_test, pred_proba_c1):
    # threshold ndarray์™€ ์ด threshold์— ๋”ฐ๋ฅธ ์ •๋ฐ€๋„, ์žฌํ˜„์œจ ndarray ์ถ”์ถœ.
    precisions, recalls, thresholds = precision_recall_curve(y_test, pred_proba_c1)
    
    # X์ถ•์„ threshold๊ฐ’์œผ๋กœ, Y์ถ•์€ ์ •๋ฐ€๋„, ์žฌํ˜„์œจ ๊ฐ’์œผ๋กœ ๊ฐ๊ฐ Plot ์ˆ˜ํ–‰. ์ •๋ฐ€๋„๋Š” ์ ์„ ์œผ๋กœ ํ‘œ์‹œ
    plt.figure(figsize=(8, 6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precisions[0:threshold_boundary], linestyle ='--', label='precision')
    plt.plot(thresholds, recalls[0:threshold_boundary], label='recall')
    
    # threshold ๊ฐ’ X ์ถ•์˜ Scale์„ 0.1 ๋‹จ์œ„๋กœ ๋ณ€๊ฒฝ
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1), 2))
    
    # x์ถ•. y์ถ• label๊ณผ legend, ๊ทธ๋ฆฌ๊ณ  grid ์„ค์ •
    plt.xlabel('Threshold value')
    plt.ylabel('Precision and Recall value')
    plt.legend()
    plt.grid()
    plt.show()
    
precision_recall_curve_plot(y_test, lr_clf.predict_proba(X_test)[:,1])

์ž„๊ณ„๊ฐ’์ด ๋‚ฎ์„ ์ˆ˜๋ก ๋งŽ์€ ์ˆ˜์˜ ์–‘์„ฑ ์˜ˆ์ธก์œผ๋กœ ์ธํ•ด ์žฌํ˜„์œจ์ด ๊ทน๋„๋กœ ๋†’

  • ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์˜ ๋งน์ 

๋‘ ๊ฐœ ์ˆ˜์น˜๋ฅผ ์ƒํ˜ธ๋ณด์™„ ํ•  ์ˆ˜ ์žˆ๋Š” ์ˆ˜์ค€์—์„œ ์ ์šฉ๋ผ์•ผ ํ•จ. ๋‹จ์ˆœํžˆ ํ•˜๋‚˜์˜ ์„ฑ๋Šฅ ์ง€ํ‘œ ์ˆ˜์น˜๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•œ ์ˆ˜๋‹จ์œผ๋กœ ์‚ฌ์šฉ๋ผ์„œ๋Š” ์•ˆ ๋จ.

04. F1 ์Šค์ฝ”์–ด

F1 ์Šค์ฝ”์–ด๋Š” ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์„ ๊ฒฐํ•ฉํ•œ ์ง€ํ‘œ
์ •๋ฐ€๋„ ๋˜๋Š” ์ •ํ™•๋„ ์ชฝ์œผ๋กœ ์น˜์šฐ์น˜์ง€ ์•Š๋Š” ์ˆ˜์น˜๋ฅผ ๋‚˜ํƒ€๋‚ผ ๋•Œ ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์€ ๊ฐ’์„ ๊ฐ€์ง

from sklearn.metrics import f1_score
f1 = f1_score(y_test, pred)
print('F1 ์Šค์ฝ”์–ด: {0:.4f}'.format(f1))

#์ด์ œ f1 ์Šค์ฝ”์–ด๋„ ์ถœ๋ ฅ๊ฐ’์— ์ถ”๊ฐ€

def get_clf_eval(y_test, pred):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    #F1 ์Šค์ฝ”์–ด ์ถ”๊ฐ€
    f1 = f1_score(y_test, pred)
    print('์˜ค์ฐจ ํ–‰๋ ฌ')
    print(confusion)
    #f1 score print ์ถ”๊ฐ€
    print('์ •ํ™•๋„: {0:.4f}, ์ •๋ฐ€๋„: {1:.4f}, ์žฌํ˜„์œจ: {2:.4f}, F1:{3:.4f}'.format(accuracy, precision, recall, f1))
    
thresholds = [0.4, 0.45, 0.50, 0.55, 0.60]
pred_proba = lr_clf.predict_proba(X_test)
get_eval_by_threshold(y_test, pred_proba[:,1].reshape(-1, 1), thresholds)
[Output]
์ž„๊ณ„๊ฐ’: 0.4
์˜ค์ฐจ ํ–‰๋ ฌ
[[97 21]
 [11 50]]
์ •ํ™•๋„: 0.8212, ์ •๋ฐ€๋„: 0.7042, ์žฌํ˜„์œจ: 0.8197, F1:0.7576
์ž„๊ณ„๊ฐ’: 0.45
์˜ค์ฐจ ํ–‰๋ ฌ
[[105  13]
 [ 13  48]]
์ •ํ™•๋„: 0.8547, ์ •๋ฐ€๋„: 0.7869, ์žฌํ˜„์œจ: 0.7869, F1:0.7869
์ž„๊ณ„๊ฐ’: 0.5
์˜ค์ฐจ ํ–‰๋ ฌ
[[108  10]
 [ 14  47]]
์ •ํ™•๋„: 0.8659, ์ •๋ฐ€๋„: 0.8246, ์žฌํ˜„์œจ: 0.7705, F1:0.7966
์ž„๊ณ„๊ฐ’: 0.55
์˜ค์ฐจ ํ–‰๋ ฌ
[[111   7]
 [ 16  45]]
์ •ํ™•๋„: 0.8715, ์ •๋ฐ€๋„: 0.8654, ์žฌํ˜„์œจ: 0.7377, F1:0.7965
์ž„๊ณ„๊ฐ’: 0.6
์˜ค์ฐจ ํ–‰๋ ฌ
[[113   5]
 [ 17  44]]
์ •ํ™•๋„: 0.8771, ์ •๋ฐ€๋„: 0.8980, ์žฌํ˜„์œจ: 0.7213, F1:0.8000

05. ROC ๊ณก์„ ๊ณผ AUC

ROC ๊ณก์„ ๊ณผ ์ด์— ๊ธฐ๋ฐ˜ํ•œ AUC ์Šค์ฝ”์–ด๋Š” ์ด์ง„ ๋ถ„๋ฅ˜์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ ์ธก์ •์—์„œ ์ค‘์š”ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ์ง€ํ‘œ
ROC ๊ณก์„ ์€ FPR(False Positive Rate)์ด ๋ณ€ํ•  ๋•Œ TPR(True Positive Rate)์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š” ์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ๊ณก์„ 

๋ฏผ๊ฐ๋„(TPR) ์‹ค์ œ๊ฐ’ Positive(์–‘์„ฑ)๊ฐ€ ์ •ํ™•ํžˆ ์˜ˆ์ธก๋ผ์•ผ ํ•˜๋Š” ์ˆ˜์ค€์„ ๋‚˜ํƒ€๋ƒ„
ํŠน์ด์„ฑ(TNR) ์‹ค์ œ๊ฐ’ Negative(์Œ์„ฑ)๊ฐ€ ์ •ํ™•ํžˆ ์˜ˆ์ธก๋ผ์•ผ ํ•˜๋Š” ์ˆ˜์ค€์„ ๋‚˜ํƒ€๋ƒ„

ROC ๊ณก์„ ์˜ X์ถ• ๊ธฐ์ค€ FPR , Y์ถ•์€ TPR

FPR = 0 โ†’ ์ž„๊ณ„๊ฐ’์„ 1๋กœ ์ง€์ •ํ•˜๋ฉด ๋จ
FPR = 1 โ†’ ์ž„๊ณ„๊ฐ’์„ 0์œผ๋กœ ์ง€์ •ํ•˜๋ฉด ๋จ

def roc_curve_plot(y_test, pred_proba_c1):
    # ์ž„๊ณ—๊ฐ’์— ๋”ฐ๋ฅธ FPR, TPR ๊ฐ’์„ ๋ฐ˜ํ™˜๋ฐ›์Œ.
    fprs, tprs, thresholds = roc_curve(y_test, pred_proba_c1)
    # ROC ๊ณก์„ ์„ ๊ทธ๋ž˜ํ”„ ๊ณก์„ ์œผ๋กœ ๊ทธ๋ฆผ
    plt.plot(fprs, tprs, label='ROC')
    # ๊ฐ€์šด๋ฐ ๋Œ€๊ฐ์„  ์ง์„ ์„ ๊ทธ๋ฆผ.
    plt.plot([0,1], [0,1], 'k--', label='Random')
    
    # FPR X์ถ•์˜ Scale์„ 0.1 ๋‹จ์œ„๋กœ ๋ณ€๊ฒฝ, X, Y์ถ• ๋ช… ์„ค์ • ๋“ฑ
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1),2))
    plt.xlim(0,1)
    plt.ylim(0,1)
    plt.xlabel('FPR(1 - Specificity)')
    plt.ylabel('TPR(Recall)')
    plt.legend()
    
roc_curve_plot(y_test, pred_proba[:,1])

๊ฐ€์šด๋ฐ ์ง์„ ์—์„œ ๋ฉ€์–ด์ง€๊ณ  ์™ผ์ชฝ ์ƒ๋‹จ ๋ชจ์„œ๋ฆฌ ์ชฝ์œผ๋กœ ๊ฐ€ํŒŒ๋ฅด๊ฒŒ ์ด๋™ํ•  ์ˆ˜๋ก ์ง์‚ฌ๊ฐํ˜•์— ๊ฐ€๊นŒ์šด ๊ณก์„ ์ด ๋˜์–ด ๋ฉด์ ์ด 1์— ๊ฐ€๊นŒ์›Œ์ง€๋Š” ์ข‹์€ ROC AUC ์„ฑ๋Šฅ ์ˆ˜์น˜๋ฅผ ์–ป๊ฒŒ ๋จ

06. ํ”ผ๋งˆ ์ธ๋””์–ธ ๋‹น๋‡จ๋ณ‘ ์˜ˆ์ธก ์‹ค์Šต

  • ์ •ํ™•๋„ ์žฌํ˜„์œจ ๋ณด์ •
#ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ X, ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์„ธํŠธ y๋ฅผ ์ถ”์ถœ
# ๋งจ ๋์ด Outcome ์นผ๋Ÿผ์œผ๋กœ ๋ ˆ์ด๋ธ” ๊ฐ’์ž„, ์นผ๋Ÿผ ์œ„์น˜ -1๋ฅผ ์ด์šฉํ•ด ์ถ”์ถœ
X= diabetes_data.iloc[:,:-1]
y = diabetes_data.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state = 156, stratify=y)
#what is stratify?

# ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋กœ ํ•™์Šต, ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€ ์ˆ˜ํ–‰
lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
pred_proba = lr_clf.predict_proba(X_test)[:,1]

get_clf_eval(y_test, pred, pred_proba)
[Output]
์˜ค์ฐจ ํ–‰๋ ฌ
[[87 13]
 [22 32]]
์ •ํ™•๋„: 0.7727, ์ •๋ฐ€๋„: 0.7111, ์žฌํ˜„์œจ: 0.5926, F1:0.6465, AUC:0.8083

์ „์ฒด ๋ฐ์ดํ„ฐ์˜ 65%๊ฐ€ Negative์ด๋ฏ€๋กœ ์žฌํ˜„์œจ ์„ฑ๋Šฅ์„ ์˜ฌ๋ ค์•ผ๋จ(์žฌํ˜„์œจ ๊ณต์‹์— Negative ์žˆ์œผ๋ฏ€๋กœ ์žฌํ˜„์œจ์ด Negative์— ์ดˆ์  ๋งž์ถ˜ ์ง€ํ‘œ)

  • ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

0์œผ๋กœ ๋ผ ์žˆ๋Š” ํ”ผ์ฒ˜๊ฐ€ ๋งŽ์Œ. ํฌ๋„๋‹น ์ˆ˜์น˜๊ฐ€ 0์ด๋ผ๋Š” ์ˆ˜์น˜๋Š” ์ž˜๋ชป๋œ ๊ฐ’.
โ†’ ๋ณด์ • ํ•„์š”

#0๊ฐ’์„ ๊ฒ€์‚ฌํ•  ํ”ผ์ฒ˜๋ช… ๋ฆฌ์ŠคํŠธ
zero_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

#์ „์ฒด ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜
total_count = diabetes_data['Glucose'].count()

# ํ”ผ์ฒ˜๋ณ„๋กœ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ ๊ฐ’์ด 0์ธ ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜๋ฅผ ์ถ”์ถœํ•˜๊ณ , ํผ์„ผํŠธ ๊ณ„์‚ฐ
for feature in zero_features:
    zero_count = diabetes_data[diabetes_data[feature]==0][feature].count()
    print('{0} 0 ๊ฑด์ˆ˜๋Š” {1}, ํผ์„ผํŠธ๋Š” {2: 2f} %'.format(feature, zero_count, 100*zero_count/total_count))
    
   
[Output]

Glucose 0 ๊ฑด์ˆ˜๋Š” 5, ํผ์„ผํŠธ๋Š”  0.651042 %
BloodPressure 0 ๊ฑด์ˆ˜๋Š” 35, ํผ์„ผํŠธ๋Š”  4.557292 %
SkinThickness 0 ๊ฑด์ˆ˜๋Š” 227, ํผ์„ผํŠธ๋Š”  29.557292 %
Insulin 0 ๊ฑด์ˆ˜๋Š” 374, ํผ์„ผํŠธ๋Š”  48.697917 %
BMI 0 ๊ฑด์ˆ˜๋Š” 11, ํผ์„ผํŠธ๋Š”  1.432292 %

0 ๊ฐ’์„ ๊ฐ€์ง„ feature๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— 0์„ ํ‰๊ท ๊ฐ’์œผ๋กœ ๋Œ€์ฒด

# zero_features ๋ฆฌ์ŠคํŠธ ๋‚ด๋ถ€์— ์ €์žฅ๋œ ๊ฐœ๋ณ„ ํ”ผ์ฒ˜๋“ค์— ๋Œ€ํ•˜์—ฌ 0๊ฐ’์„ ํ‰๊ท  ๊ฐ’์œผ๋กœ ๋Œ€์ฒด
mean_zero_features = diabetes_data[zero_features].mean()
diabetes_data[zero_features] = diabaetes_data[zero_features].replace(0, mean_zero_features)
  • ๋ชจ๋ธ ํ›ˆ๋ จ

์ˆซ์ž ๋ฐ์ดํ„ฐ์— ์Šค์ผ€์ผ๋ง ์ ์šฉ
ํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ๋‚˜๋ˆ„๊ธฐ
๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ์ด์šฉํ•ด ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ํ™•์ธ

# zero_features ๋ฆฌ์ŠคํŠธ ๋‚ด๋ถ€์— ์ €์žฅ๋œ ๊ฐœ๋ณ„ ํ”ผ์ฒ˜๋“ค์— ๋Œ€ํ•ด์„œ 0๊ฐ’์„ ํ‰๊ท  ๊ฐ’์œผ๋กœ ๋Œ€์ฒด
mean_zero_features = diabetes_data[zero_features].mean()
diabetes_data[zero_features]=diabetes_data[zero_features].replace(0, mean_zero_features)

X= diabetes_data.iloc[:,:-1]
y= diabetes_data.iloc[:,-1]

#StandardScaler ํด๋ž˜์Šค๋ฅผ ์ด์šฉํ•ด ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ผ๊ด„์ ์œผ๋กœ ์Šค์ผ€์ผ๋ง ์ ์šฉ
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=156, stratify=y)

#๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋กœ ํ•™์Šต, ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€ ์ˆ˜ํ–‰
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
pred_proba = lr_clf.predict_proba(X_test)[:,1]

get_clf_eval(y_test, pred, pred_proba)
  • ์ž„๊ณ—๊ฐ’ ์ฐพ๊ธฐ
1. ์ž„๊ณ„๊ฐ’: 0.3
์ •ํ™•๋„: 0.7143, ์ •๋ฐ€๋„: 0.5658, ์žฌํ˜„์œจ: 0.7963

2. ์ž„๊ณ„๊ฐ’: 0.33
์ •ํ™•๋„: 0.7403, ์ •๋ฐ€๋„: 0.6000, ์žฌํ˜„์œจ: 0.7778

3. ์ž„๊ณ„๊ฐ’: 0.36
์ •ํ™•๋„: 0.7468, ์ •๋ฐ€๋„: 0.6190, ์žฌํ˜„์œจ: 0.7222

4. ์ž„๊ณ„๊ฐ’: 0.39
์ •ํ™•๋„: 0.7532, ์ •๋ฐ€๋„: 0.6333, ์žฌํ˜„์œจ: 0.7037

5. ์ž„๊ณ„๊ฐ’: 0.42
์ •ํ™•๋„: 0.7792, ์ •๋ฐ€๋„: 0.6923, ์žฌํ˜„์œจ: 0.6667

6. ์ž„๊ณ„๊ฐ’: 0.45
์ •ํ™•๋„: 0.7857, ์ •๋ฐ€๋„: 0.7059, ์žฌํ˜„์œจ: 0.6667

7. ์ž„๊ณ„๊ฐ’: 0.48
์ •ํ™•๋„: 0.7987, ์ •๋ฐ€๋„: 0.7447, ์žฌํ˜„์œจ: 0.6481

8. ์ž„๊ณ„๊ฐ’: 0.5
์ •ํ™•๋„: 0.7987, ์ •๋ฐ€๋„: 0.7674, ์žฌํ˜„์œจ: 0.6111

์ž„๊ณ—๊ฐ’ 0.48์ด ์ „์ฒด์ ์ธ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์žฌํ˜„์œจ์„ ์•ฝ๊ฐ„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ข‹์€ ์ž„๊ณ—๊ฐ’


  • ์ž„๊ณ—๊ฐ’ 0.48๋กœ ํ›ˆ๋ จ ๋ฐ ์˜ˆ์ธก ํด๋ž˜์Šค ๊ฐ’ ๋„์ถœ
#์ž„๊ณ—๊ฐ’์„ 0.48๋กœ ์„ค์ •ํ•œ Binarizer ์ƒ์„ฑ 
binarizer = Binarizer(threshold=0.48)

#์œ„์—์„œ ๊ตฌํ•œ lr_clf์˜ predict_proba() ์˜ˆ์ธก ํ™•๋ฅ  array์—์„œ 1์— ํ•ด๋‹นํ•˜๋Š” ์นผ๋Ÿผ๊ฐ’์„ Binarizer ๋ณ€ํ™˜
pred_th_048 = binarizer.fit_transform(pred_proba[:,1].reshape(-1,1))

get_clf_eval(y_test, pred_th_048, pred_proba[:,1])
[Output]
์˜ค์ฐจ ํ–‰๋ ฌ
[[88 12]
 [19 35]]
์ •ํ™•๋„: 0.7987, ์ •๋ฐ€๋„: 0.7447, ์žฌํ˜„์œจ: 0.6481, F1:0.6931, AUC:0.8433
profile
for well-being we need nectar and ambrosia

0๊ฐœ์˜ ๋Œ“๊ธ€