머신러닝을 통한 WINE 분류

김승환·2021년 7월 15일

머신러닝

목록 보기

1/5


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline, Pipeline

KEY를 통해 불러온 손글씨파일을 확인한다.

#key를 통해 내용확인
wine = load_wine()
print(wine.keys())

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

wine 데이터에는 13개의 변수(특징)이 있고 총 178개의 데이터(행)가 있다.

#feature를 저장
wine_data = wine.data
print(wine_data.shape)

(178, 13)

wine 데이터를 숫자로 잘 불러온 모습이다.

#feature데이터의 0번째 데이터의 숫자 확인
wine_data[0]

array([1.423e+01, 1.710e+00, 2.430e+00, 1.560e+01, 1.270e+02, 2.800e+00,
3.060e+00, 2.800e-01, 2.290e+00, 5.640e+00, 1.040e+00, 3.920e+00,
1.065e+03])

데이터가 178개의 행을 가지고 있으므로 라벨 또한 178행을 가지고 있다.


#라벨 저장
wine_label = wine.target
print(wine_label.shape)

(178,)

#라벨이 들어가 있는 모양 확인
wine_label[0:20]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

라벨은 3가지가 있다. 와인 종류가 3가지로 생각하면 될 것 같다.

#라벨들의 종류를 확인
wine.target_names

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

train test를 분리하여 모델에 적용하기 위해 split을 진행한다.

#train데이터롸 test데이터를 위해 split하기
X_train, X_test, y_train, y_test = train_test_split(wine_data, 
                                                    wine_label, 
                                                    test_size=0.2, 
                                                    random_state=1)

모델에 적용하기 전에 wine 데이터는 어떤 오차행렬을 사용하는 것이 좋을까?

wine 데이터는 와인의 종류를 구분하는 문제이기때문에 정확도를 보는 것이 적절하고 생각한다. 0, 1, 2로 분류된 class에서 0을 0으로 예측하고 1을 1로 2를 2로 예측을 잘 해주는 모델이 적합하기 때문이다. 암을 양성과 음성을 분류하는 모델에서는 양성을 음성이라고 판단하면 안되는 상황이 있지만 wine은 그런 상황이 딱히 없다. 0을 1이라 예측하면 잘못분류했을 뿐 특별한 조건이 필요한 것은 없다.

의사결정나무

#모델의 적용
decision_tree = DecisionTreeClassifier(random_state=32)
decision_tree.fit(X_train, y_train)

#예측해보기
y_pred = decision_tree.predict(X_test)

#정확도 비교해보기
decision_accuracy = accuracy_score(y_test, y_pred)
print('의사결정나무의 정확도 : ',decision_accuracy)

# 분석결과 확인
decision_report = classification_report(y_test, y_pred)
print(decision_report)

# 오차행렬인 컴퓨전 메트릭스 확인
decision_matrix = confusion_matrix(y_test, y_pred)
print(decision_matrix)

의사결정나무의 정확도 : 0.8888888888888888
precision recall f1-score support

       0       0.93      1.00      0.97        14
       1       0.80      0.92      0.86        13
       2       1.00      0.67      0.80         9

micro avg 0.89 0.89 0.89 36
macro avg 0.91 0.86 0.87 36
weighted avg 0.90 0.89 0.89 36

[[14 0 0][ 1 12 0]
[ 0 3 6]]

랜덤포레스트

#모델의 적용
random_forest = RandomForestClassifier(random_state=32)
random_forest.fit(X_train, y_train)

#예측해보기
y_pred = random_forest.predict(X_test)

#정확도 비교해보기
random_accuracy = accuracy_score(y_test, y_pred)
print('랜덤포레스트의 정확도 : ',random_accuracy)

# 분석결과 확인
random_report = classification_report(y_test, y_pred)
print(random_report)

# 오차행렬인 컴퓨전 메트릭스 확인
random_matrix = confusion_matrix(y_test, y_pred)
print(random_matrix)

랜덤포레스트의 정확도 : 0.9722222222222222
precision recall f1-score support

       0       0.93      1.00      0.97        14
       1       1.00      0.92      0.96        13
       2       1.00      1.00      1.00         9

micro avg 0.97 0.97 0.97 36
macro avg 0.98 0.97 0.98 36
weighted avg 0.97 0.97 0.97 36

[[14 0 0][ 1 12 0]
[ 0 0 9]]

SVM

#모델의 적용
svm_model = svm.SVC(kernel='linear')
svm_model.fit(X_train, y_train)

#예측해보기
y_pred = svm_model.predict(X_test)

#정확도 비교해보기
svm_accuracy = accuracy_score(y_test, y_pred)
print('SVM의 정확도 : ',svm_accuracy)

# 분석결과 확인
svm_report = classification_report(y_test, y_pred)
print(svm_report)

# 오차행렬인 컴퓨전 메트릭스 확인
svm_matrix = confusion_matrix(y_test, y_pred)
print(svm_matrix)

SVM의 정확도 : 0.9444444444444444
precision recall f1-score support

       0       0.93      1.00      0.97        14
       1       0.92      0.92      0.92        13
       2       1.00      0.89      0.94         9

micro avg 0.94 0.94 0.94 36
macro avg 0.95 0.94 0.94 36
weighted avg 0.95 0.94 0.94 36

[[14 0 0][ 1 12 0]
[ 0 1 8]]

SGD

#모델의 적용
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)

#예측해보기
y_pred = sgd_model.predict(X_test)

#정확도 비교해보기
sgd_accuracy = accuracy_score(y_test, y_pred)
print('sgd의 정확도 : ',sgd_accuracy)

# 분석결과 확인
sgd_report = classification_report(y_test, y_pred)
print(sgd_report)

# 오차행렬인 컴퓨전 메트릭스 확인
sgd_matrix = confusion_matrix(y_test, y_pred)
print(sgd_matrix)

sgd의 정확도 : 0.5277777777777778
precision recall f1-score support

       0       0.48      1.00      0.65        14
       1       0.71      0.38      0.50        13
       2       0.00      0.00      0.00         9

micro avg 0.53 0.53 0.53 36
macro avg 0.40 0.46 0.38 36
weighted avg 0.45 0.53 0.43 36

[[14 0 0][ 8 5 0]
[ 7 2 0]]

로지스틱리그레션

#모델의 적용
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

#예측해보기
y_pred = logistic_model.predict(X_test)

#정확도 비교해보기
logistic_accuracy = accuracy_score(y_test, y_pred)
print('LogisticRegression의 정확도 : ',logistic_accuracy)

# 분석결과 확인
logistic_report = classification_report(y_test, y_pred)
print(logistic_report)

# 오차행렬인 컴퓨전 메트릭스 확인
logistic_matrix = confusion_matrix(y_test, y_pred)
print(logistic_matrix)

LogisticRegression의 정확도 : 0.9444444444444444
precision recall f1-score support

       0       1.00      0.93      0.96        14
       1       0.87      1.00      0.93        13
       2       1.00      0.89      0.94         9

micro avg 0.94 0.94 0.94 36
macro avg 0.96 0.94 0.94 36
weighted avg 0.95 0.94 0.95 36

[[13 1 0][ 0 13 0]
[ 0 1 8]]

print('의사결정나무의 정확도 : ',decision_accuracy)
print('랜덤포레스트의 정확도 : ',random_accuracy)
print('SVM의 정확도 : ',svm_accuracy)
print('sgd의 정확도 : ',sgd_accuracy)
print('LogisticRegression의 정확도 : ',logistic_accuracy)

의사결정나무의 정확도 : 0.8888888888888888
랜덤포레스트의 정확도 : 0.9722222222222222
SVM의 정확도 : 0.9444444444444444
sgd의 정확도 : 0.5277777777777778
LogisticRegression의 정확도 : 0.9444444444444444

분류문제에서는 랜덤포레스트가 좋은 성능을 보이고 있다. 의사결정 나무를 다수 생성하고 class를 분류하는 것이 wine데이터에서는 긍정적인 성능을 보이는 것 같다. 물론 다른 모델도 좋은 성능을 보이고 있다. 그만큼 wine 데이터가 데이터를 분석하기에 잘 구분되어진 데이터라서 아닐까? ㅎㅎ