분류분석

jh_k·2023년 2월 6일

Classification

데이터 분석

목록 보기

5/17

분류분석

실제분류와 예측분류가 얼마나 일치했는가를 기반으로 알고리즘 성능을 평가

정확도

실제 데이터에서 예측 데이터가 얼마나 같은지 판단하는 지표
데이터 구성에 따라 머신러닝 모델의 성능을 왜곡할 가능성이 존재

혼동행렬

이진 분류의 예측오류가 얼마이고 어떠한 유형의 예측 오류가 발생하고 있는지 나타내는 지표
4분면 행렬에서 실제 라벨 클래스 값과 예측 라벨 클래스 값이 어떠한 유형을 가지고 매핑이 되는지 나타냄
- TN : 예측값이 Negative(0), 실제값도 Negative(0)
- FP : 예측값이 Positive(1), 실제값은 Negative(0)
- FN : 예측값이 Negative(0), 실제값은 Positive(1)
- TP : 예측값이 Positive(1), 실제값도 Positive(1)

정밀도

Positive로 예측한 것들 중 실제로도 Positive인 것들의 비율
Positive 예측성능을 더욱 정밀하게 측정하기 위한 평가지표
양성 예측도라 불린다.
정밀도가 상대적인 중요성을 가지는 경우는 실제 Negative인 데이터를 Positive로 잘못 예측했을 때 업무상 큰 영향을 발행할 때

재현율

실제 Positive인 것들 중 Positive로 예측한 비율

F1 스코어

정밀도와 재현율을 결합한 분류 성능지표
실제 Positive인 것들 중 Positive로 예측한 것들의 비율
정밀도와 재현율이 어느 한 쪽으로 치우치지 않고 적절한 조화를 이룰 때 상대적으로 높은 수치를 나타냄

ROC 곡선과 AUC 스코어

ROC 곡선은 FPR(False Positive Rate)이 변할 때 TPR(True Positive Rate)이 변하는 것을 나타내는 곡선 (ROC)
AUC 스코어는 ROC 곡선 아래의 면적 값을 분류하는 성능지표로서 사용

## 데이터의 확인
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris() # iris 로드하기
iris_dt = iris.data # iris/data는 독립변수(feature)만으로 된 numpy 형태
iris_label = iris.target # isis.target은 종속변수(label) 값을 numpy 형태로 가짐

df = pd.DataFrame(data=iris_dt, columns=iris.feature_names)
df['Species'] = iris_label

## 데이터의 분할
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    iris_dt, 
    iris_label, 
    test_size=0.2, 
    random_state=0, 
    stratify=iris_label
)

## 모델학습
## 의사결정나무를 사용하여 분류 분석 수행
## 트리의 깊이를 5, 3, 1로 설정한 의사결정나무 모델 세가지를 생성
from sklearn.tree import DecisionTreeClassifier
dtree_clf_5 = DecisionTreeClassifier(max_depth=5, random_state=100)
dtree_clf_3 = DecisionTreeClassifier(max_depth=3, random_state=100)
dtree_clf_1 = DecisionTreeClassifier(max_depth=1, random_state=100)

## 트리의 깊이가 5인 값으로 설정하고 검증평가를 10회 진행
from sklearn.model_selection import cross_val_score
import numpy as np
scores_1 = cross_val_score(dtree_clf_5, x_train, y_train, scoring='accuracy', cv=10)
print('교차검증 정확도: ', np.round(scores_1, 3))
print('평균 검증 정확도: ', np.round(np.mean(scores_1), 4))

## 트리의 깊이가 3인 값으로 설정하고 검증평가를 10회 진행
scores_2 = cross_val_score(dtree_clf_3, x_train, y_train, scoring='accuracy', cv=10)
print('교차검증 정확도: ', np.round(scores_2, 3))
print('평균 검증 정확도: ', np.round(np.mean(scores_2), 4))

## 트리의 깊이가 1인 값으로 설정하고 검증평가를 10회 진행
scores_3 = cross_val_score(dtree_clf_1, x_train, y_train, scoring='accuracy', cv=10)
print('교차검증 정확도: ', np.round(scores_3, 3))
print('평균 검증 정확도: ', np.round(np.mean(scores_3), 4))

## 교차검증 결과 트리의 깊이가 5인 경우 정확도가 가장 우수
## 해당 모델로 평가데이터를 적용하여 알고리즘 성능평가를 수행 후 예측값 지정
dtree_clf_5.fit(x_train, y_train)
pred = dtree_clf_5.predict(x_test)
from sklearn.metrics import accuracy_score
print('의사결정나무(교차검증 후) 예측 정확도: {0:.5f}'.format(accuracy_score(y_test, pred)))

jh_k

Just Enjoy Yourself

이전 포스트

회귀분석

다음 포스트

분류분석

데이터 분석

분류분석

정확도

혼동행렬

정밀도

재현율

F1 스코어

ROC 곡선과 AUC 스코어

회귀분석

단순선형회귀

0개의 댓글