[DS] Classification

rkqhwkrn·2022년 12월 30일

datascience python

Python

목록 보기

5/13

Classification

Classification Model

Decision tree
Random forest
Naive Bayes

Classification Metrics

Confusion Matrix
Accuracy
Precision, Recall, F1-score
ROC-AUC

setting

pandas 기본 세팅

import pandas as pd  
import numpy as mp  
# 경고 무시  
import warnings  
warnings.filterwarnings('ignore')  
  
print('pandas default set : pd.options.display.max_rows', pd.options.display.max_rows)
print('pandas default set : pd.get_options("max_rows")', pd.get_option('max_rows'), end='\n\n')

print('pandas default set : pd.options.display.max_columns', pd.options.display.max_columns)
print('pandas default set : pd.get_options("max_columns")', pd.get_option('max_columns'))

실행결과

Google colab 설정

from google.colab import drive  
drive.mount('/content/drive')  

path = '/content/drive/MyDrive/'

data = pd.read_csv(path + '/[파일위치]/[파일명].[파일확장자명]')
data.head()  # 데이터 프레임 확인  
data.info()  # 데이터 정보(column, non-null 수, dtype) 확인  
display(data.describe(), data.isna().sum())  # 데이터 정보(수, 평균, 분산 등), NaN 수 확인

실행결과

결측치 확인

data.isna().sum()[data.isna().sum()>0]

meta data 만들기

meta data란?
데이터에 대한 데이터
데이터를 표현하기 위해 사용
데이터를 빨리 찾기 위해 사용

meta data 생성

meta_data = pd.DataFrame(index=data.columns)
meta_data['Dtype'] = data.dtypes
meta_data['NaN'] = data.isnull().sum()
meta_data['mean'] = data.mean()
meta_data['max'] = data.max()
meta_data['min'] = data.min()
meta_data['unique'] = [data[i].unique() for i in data.columns] # 데이터 고유값들이 어떤 종류가 있는지 확인
meta_data['nunique'] = [data[i].nunique() for i in data.columns] # 데이터 고유값들의 수

실행결과

개별 탐색

np.sort(data['YrSold'].unique()) # YrSold column의 데이터 고유값 정렬
data['YrSold'].max()
data['YrSold'].min()

실행결과

결측치 처리 (drop)

data.shape # drop 이전의 행과 열의 개수를 튜플로 변환  
na_feat = ['Alley', 'FireplaceQu', 'PoolQC', 'Fence', "MiscFeature']  # drop할 feature 선택  
data.drop(na_featu, inplace=True, axis=1)
data.shape # drop 이후의 행과 열의 개수를 튜플로 변환

실행결과

drop 전/후 결측값 비교

data.isna().sum()[data.isna().sum()>0]  # 결측값을 포함한 column과 결측값 수 출력

실행결과

drop 전

drop 후

데이터 전처리

Target Feature 지정

y = data['SalePrice']

data.drop(['Id','SalePrice'],axis=1,inplace=True)

for i in data.columns[data.isna().sum()>0]:
	if data[i].dtype == 'object':
    	data[i] = data[i].fillna('None')  # type이 object인 경우 결측치 처리
    else:
    	data[i] = data[i].fillna(-1)  # type이 int나 float인 경우 결측치 처리


print('Total nan :', data.isna().sum().sum())  # NaN이 남아있는지 확인

실행결과

Scaling

MinMaxScaler : [0:1]
$\frac{x-min}{max-min}$

numerical_f = list(data.columns[data.dtypes != 'object'])  # data type이 int나 float인 산술 feature 추출  
numerical_f.remove('YrSold')  # 산술 feature에서 YrSold 제외 (scaling 할 대상에서 제외)
print(numerical_f)

실행결과

data['YrSold'] = data['YrSold'].astype('object')  # YrSold의 datatype을 변경

sklearn(사이킷런)의 MinMaxScaler를 이용

from sklearn.preprocessing import MinMaxScaler  
scaler = MinMaxScaler()  
  
data[numerical_f] = scaler.fit_transform(data[numerical_f])

scaling 전

scaling 후

Labeling

Categorical variable을 encoding 하는 것
왜? 머신러닝 categorical variable을 다루지 못하기 때문

categorical_f = list(data.columns[data.dtypes == 'object'])  # categorical feature를 추출

sklearn(사이킷런)의 LabelEncoder를 이용

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

for i in categorical_f:
	data[i] = encoder.fit_transform(data[i])

labeling 전

labeling 후

Target feature labeling

Target feature 또한 구간을 나눠줄 수 있다.

encoded_y = pd.cut(y,5,labels=[1,2,3,4,5])
encoded_y.value.counts()

실행결과

데이터셋 나누기

sklearn(사이킷런)의 train_test_split 이용

from sklearn.model_selection import train_test_split

print(data.shape, endcoded_y.shape)  # 전처리한 데이터프레임과 target feature의 행/열 정보 확인

x_train, x_test, y_train, y_test = train_test_split(data, encoded_y, test_size=0.2, random_state=0)  # random state는 난수 초기값  

print('train data: ', x_train.shape, y_train.shape)
print('test data: ', x_test.shape, y_test.shape)

실행결과

Classification

Decision tree

Classification과 Regression이 모두 가능함
질문에 따라 데이터를 구분하는 모델
entropy, gain ratio, gini 등의 점수를 통해 데이터를 가장 잘 구분할 수 있는 질문 선택
sklearn(사이킷런)의 DecisionTreeClassifier 이용

dt = DecisionTreeClassifier(random_state=0) # max_depth=None, criterion='gini', random_state=None  

dt.fit(x_train, y_train)

dt_pred = dt.predict(x_test)  # 예측 결과
dt_pred

dt.predict_proba(x_test)  # 각 클래스에 대한 확률

실행결과

Decision tree 시각화

sklearn(사이킷런)의 graphviz 이용

from sklearn.tree import export_graphviz
import graphviz
import pydotplus

export_graphviz(dt, out_file='tree.dot',
				feature_names = x_train.columns,
                class_names = 'SalesPrice',
                max_depth = 3, # 최대 depth
                precision = 3, # 소숫점 자릿수
                filled = True, # class별 color
                rounded = True, # 둥근 박스
               )
 
with open('tree.dot') as f:
	dot_graph = f.read()
graphviz.Source(dot_graph)
# 크기 조절
pydot_graph = pydotplus.graph_from_dot_data(dot_graph)
pydot_graph.set_size('"16,12"')
pydot_graph.write_png('resized_tree.png')
gvz_graph = graphviz.Source(pydot_graph.to_string())
gvz_graph

실행결과

Random forest

Decision tree는 overfitting 될 수 있다는 약점이 있음
ensemble(앙상블) 개념을 도입하여 해결
여러 decision tree를 생성하고 투표를 통해 최적의 결과 도출

sklearn(사이킷런)의 RandomForestClassifier 이용

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state = 0)

rf.fit(x_train, y_train)
rf_pred = rf.predict(x_test)
rf_pred

rf_predict_proba(x_test)

실행결과

Naive Bayes

x가 서로 독립임을 가정 -> PCA(eigan vector)
Gaussian Naive Bayes
Multinomial Naive Bayes
Bernoulli Naive Bayes
Categorical Naive Bayes

sklearn(사이킷런)의 naive_bayes 이용

from sklearn.naive_base import GaussianNB

gnb = GaussianNB()

gnb.fit(x_train, y_train)
gnb_pred = gnb.predict(x_test)
gnb_pred

gnb.predict_proba(x_test)

실행결과

Save Trained-model

Train한 모델 저장

import joblib

joblib.dump(gnb, './model_temp.pkl') # save

loaded_model = joblib.load('./model_tmp.pkl') # load

loaded_model.predict(x_test) == gnb_pred  # 로드한 모델 확인 테스트

실행결과

Confusion Matrix

Binary classification

from sklearn.metrics import confusion_matrix

## (예제 1) binary한 경우 예시
b_y_true = [0,1,0,1]
b_y_pred = [1,1,1,0]

confusion_matrix(b_y_true, b_y_pred)

tn, fp, fn, tp = confusion_matrix(b_y_true, b_y_pred).ravel()
(tn, fp, fn, tp)

## (예제 2) multi class인 경우 예시
m_y_true = [0,2,1,2]
m_y_pred = [0,1,2,2]

confusion_matrix(m_y_true, m_y_pred)

실행결과

Accuracy 측정

Accuracy (정확도)

binary classification인 경우 정확도

$\frac{TP+TN}{TP+TN+FP+FN}$

from sklearn.metrics import accuracy_score

# (예제 1) binary한 경우
accuracy_score(b_y_true, b_y_pred)  # Accuracy score 값
(tp + tn) / (tp + tn + fp + fn)  # (TP + TN) / (TP + TN + FP + FN) 값

# (예제 2) multi class인 경우
accuracy_score(m_y_true, m_y_pred)  # Accuracy score 값
(np.array(m_y_true) == np.array(m_y_pred)).sum() / len(m_y_true)  # (TP + TN) / (TP + TN + FP + FN) 값
np.trace(confusion_matrix(m_y_true, m_y_pred)) / len(m_y_true)  # np.trace: 대각 합

실행결과

Precision (정밀도)

양성으로 예측한 것 중에 실제 양성의 비율

$\frac{TP}{TP+FP}$

from sklearn.metrics import precision_score

# (예제 1) binary한 경우
precision_score(b_y_true, b_y_pred)  # Precision score 값  
tp / (tp + fp)  # TP / (TP + FP) 값

실행결과

Recall (재현율)

실제 양성 중 양성으로 측정된 것의 비율

$\frac{TP}{TP+FN}$

from sklearn.metrics import recall_score

# (예제 1) binary한 경우
recall_score(b_y_true, b_y_pred)  # Recall score 값
tp / (tp +fn)  # TP / (TP + FN) 값

실행결과

F1 score

일반적인 경우 recall이 높으면 precision이 낮음
반대로 precision이 높으면 recall이 낮음
precision과 recall의 조화 평균으로 F1 점수를 산출

$F1 score = 2*\frac{precision * recall}{precision + recall}$

from sklearn.metrics import f1_score

# (예제 1) binary한 경우
f1_score(b_y_true, b_y_pred)  # f1 score 값
2 * precision * recall / (precision + recall)

실행결과

ROC-AUC

Binary classification의 문제를 극복
ROC curve: receiver operation characteristic curve
AUC curve: Area Under the ROC curve
class 분포가 다를 때 accuracy의 단점을 보완
AUC가 클 수록 안정적 (AUC=1이면 모든 양성 음성 분류 / AUC=0.5이면 분류 불가능)
TPR (True Positive Rate)
- $TPR = \frac{TP}{TP+FN}$
FPR (False Positive Rate)
- $FPR = \frac{FP}{FP+TN}$

from sklearn.metrics import auc, roc_curve

auc_true = np.array([0,0,1,1])
auc_pred = np.array([0.1, 0.4, 0.35, 0.8])

fpr, tpr, thres = roc_curve(auc_true, auc_pred)

auc(fpr, tpr)  # auc score

실행결과

rkqhwkrn

이전 포스트

[pandas] 정리 (2)

다음 포스트

[DS] Classification

Python

Classification

contents

Classification Model

Classification Metrics

setting

pandas 기본 세팅

Google colab 설정

결측치 확인

meta data 만들기

meta data 생성

개별 탐색

결측치 처리 (drop)

데이터 전처리

Target Feature 지정

Scaling

Labeling

Target feature labeling

데이터셋 나누기

Classification

Decision tree

Decision tree 시각화

Random forest

Naive Bayes

Save Trained-model

Confusion Matrix

Accuracy 측정

Accuracy (정확도)

Precision (정밀도)

Recall (재현율)

F1 score

ROC-AUC

[pandas] 정리 (2)

[DS] Frequent Pattern Mining

0개의 댓글

관련 채용 정보