머신러닝 손글씨 분류

김승환·2021년 7월 15일

머신러닝

목록 보기

2/5

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import confusion_matrix

KEY를 통해 불러온 손글씨파일을 확인한다.


#key를 통해 내용확인
digits = load_digits()
print(digits.keys())

dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

손글씨 데이터에는 64개의 변수(특징)이 있고 총 1797개의 데이터(행)가 있다

#feature를 저장
digits_data = digits.data
print(digits_data.shape)

(1797, 64)

이미지 데이터를 숫자로 잘 불러온 모습이다.

#feature데이터의 0번째 데이터의 숫자 확인
digits_data[0]

array([ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10.,
15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4.,
12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8.,
0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.])

이미지를 숫자로 불러왔지만 matplotlib을 이용해서 그림모양으로 확인을 한다.

#feature의 모양 확인
plt.imshow(digits.data[0].reshape(8, 8), cmap='gray')
plt.axis('off')
plt.show()

데이터가 1797개의 행을 가지고 있으므로 라벨 또한 1797행을 가지고 있다.

#라벨 저장
digits_label = digits.target
print(digits_label.shape)

(1797,)

#라벨이 들어가 있는 모양 확인
digits_label[0:20]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

라벨은 0~9의 값을 가진다.

#라벨들의 종류를 확인
digits.target_names

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

train test를 분리하여 모델에 적용하기 위해 split을 진행한다.

#train데이터롸 test데이터를 위해 split하기
X_train, X_test, y_train, y_test = train_test_split(digits_data, 
                                                    digits_label, 
                                                    test_size=0.2, 
                                                    random_state=1)

모델에 적용하기 전에 이미지 데이터는 어떤 오차행렬을 사용하는 것이 좋을까?

이미지 데이터는 사실 정확도를 보면 된다. 이미지에 따라 0~9의 class를 분류하는 문제이고 모델이 0~9에 해당하는 class를 얼마나 잘 맞추는 지가 중요한 포인트인 것이다.

의사결정나무

#모델의 적용
decision_tree = DecisionTreeClassifier(random_state=32)
decision_tree.fit(X_train, y_train)

#예측해보기
y_pred = decision_tree.predict(X_test)

# 분석결과 확인
decision_report = classification_report(y_test, y_pred)
print(decision_report)

# 오차행렬인 컴퓨전 메트릭스 확인
decision_matrix = confusion_matrix(y_test, y_pred)
print(decision_matrix)

#정확도 비교해보기
decision_accuracy = accuracy_score(y_test, y_pred)
print('의사결정나무의 정확도 : ',decision_accuracy)

          precision    recall  f1-score   support

       0       1.00      0.93      0.96        43
       1       0.93      0.77      0.84        35
       2       0.94      0.86      0.90        36
       3       0.80      0.78      0.79        41
       4       0.77      0.89      0.83        38
       5       0.83      0.97      0.89        30
       6       0.92      0.97      0.95        37
       7       0.89      0.86      0.88        37
       8       0.78      0.86      0.82        29
       9       0.75      0.71      0.73        34

micro avg 0.86 0.86 0.86 360
macro avg 0.86 0.86 0.86 360
weighted avg 0.87 0.86 0.86 360

[[40 0 0 0 1 1 0 0 1 0][ 0 27 1 1 1 1 0 0 2 2]
[ 0 0 31 1 0 0 3 0 1 0][ 0 0 1 32 0 0 0 2 2 4]
[ 0 1 0 0 34 1 0 2 0 0][ 0 0 0 0 0 29 0 0 0 1]
[ 0 0 0 1 0 0 36 0 0 0][ 0 0 0 1 3 0 0 32 0 1]
[ 0 0 0 2 1 1 0 0 25 0][ 0 1 0 2 4 2 0 0 1 24]]
의사결정나무의 정확도 : 0.8611111111111112

랜덤포레스트

#모델의 적용
random_forest = RandomForestClassifier(random_state=32)
random_forest.fit(X_train, y_train)

#예측해보기
y_pred = random_forest.predict(X_test)

# 분석결과 확인
random_report = classification_report(y_test, y_pred)
print(random_report)

# 오차행렬인 컴퓨전 메트릭스 확인
random_matrix = confusion_matrix(y_test, y_pred)
print(random_matrix)

#정확도 비교해보기
random_accuracy = accuracy_score(y_test, y_pred)
print('랜덤포레스트의 정확도 : ',random_accuracy)

          precision    recall  f1-score   support

       0       0.98      0.98      0.98        43
       1       0.89      0.97      0.93        35
       2       0.92      0.94      0.93        36
       3       0.95      0.95      0.95        41
       4       0.97      0.97      0.97        38
       5       0.90      0.93      0.92        30
       6       1.00      1.00      1.00        37
       7       0.90      0.97      0.94        37
       8       0.96      0.76      0.85        29
       9       0.91      0.85      0.88        34

micro avg 0.94 0.94 0.94 360
macro avg 0.94 0.93 0.93 360
weighted avg 0.94 0.94 0.94 360

[[42 0 0 0 1 0 0 0 0 0][ 0 34 1 0 0 0 0 0 0 0]
[ 0 0 34 1 0 0 0 0 0 1][ 0 1 0 39 0 1 0 0 0 0]
[ 0 0 0 0 37 0 0 1 0 0][ 0 0 0 0 0 28 0 0 1 1]
[ 0 0 0 0 0 0 37 0 0 0][ 0 0 0 0 0 0 0 36 0 1]
[ 1 2 2 0 0 2 0 0 22 0][ 0 1 0 1 0 0 0 3 0 29]]
랜덤포레스트의 정확도 : 0.9388888888888889

SVM

#모델의 적용
svm_model = svm.SVC(kernel='linear')
svm_model.fit(X_train, y_train)

#예측해보기
y_pred = svm_model.predict(X_test)

# 분석결과 확인
svm_report = classification_report(y_test, y_pred)
print(svm_report)

# 오차행렬인 컴퓨전 메트릭스 확인
svm_matrix = confusion_matrix(y_test, y_pred)
print(svm_matrix)

#정확도 비교해보기
svm_accuracy = accuracy_score(y_test, y_pred)
print('SVM의 정확도 : ',svm_accuracy)

          precision    recall  f1-score   support

       0       1.00      1.00      1.00        43
       1       1.00      1.00      1.00        35
       2       1.00      1.00      1.00        36
       3       1.00      1.00      1.00        41
       4       1.00      1.00      1.00        38
       5       0.94      1.00      0.97        30
       6       1.00      1.00      1.00        37
       7       1.00      0.97      0.99        37
       8       1.00      0.97      0.98        29
       9       0.97      0.97      0.97        34

micro avg 0.99 0.99 0.99 360
macro avg 0.99 0.99 0.99 360
weighted avg 0.99 0.99 0.99 360

[[43 0 0 0 0 0 0 0 0 0][ 0 35 0 0 0 0 0 0 0 0]
[ 0 0 36 0 0 0 0 0 0 0][ 0 0 0 41 0 0 0 0 0 0]
[ 0 0 0 0 38 0 0 0 0 0][ 0 0 0 0 0 30 0 0 0 0]
[ 0 0 0 0 0 0 37 0 0 0][ 0 0 0 0 0 0 0 36 0 1]
[ 0 0 0 0 0 1 0 0 28 0][ 0 0 0 0 0 1 0 0 0 33]]
SVM의 정확도 : 0.9916666666666667

SGD


#모델의 적용
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)

#예측해보기
y_pred = sgd_model.predict(X_test)

# 분석결과 확인
sgd_report = classification_report(y_test, y_pred)
print(sgd_report)

# 오차행렬인 컴퓨전 메트릭스 확인
sgd_matrix = confusion_matrix(y_test, y_pred)
print(sgd_matrix)


#정확도 비교해보기
sgd_accuracy = accuracy_score(y_test, y_pred)
print('sgd의 정확도 : ',sgd_accuracy)

          precision    recall  f1-score   support

       0       0.98      0.98      0.98        43
       1       0.97      0.97      0.97        35
       2       1.00      0.94      0.97        36
       3       1.00      0.95      0.97        41
       4       0.93      1.00      0.96        38
       5       0.91      1.00      0.95        30
       6       1.00      1.00      1.00        37
       7       0.97      0.92      0.94        37
       8       0.88      0.97      0.92        29
       9       0.97      0.88      0.92        34

micro avg 0.96 0.96 0.96 360
macro avg 0.96 0.96 0.96 360
weighted avg 0.96 0.96 0.96 360

[[42 0 0 0 1 0 0 0 0 0][ 0 34 0 0 1 0 0 0 0 0]
[ 0 1 34 0 0 0 0 1 0 0][ 0 0 0 39 0 1 0 0 1 0]
[ 0 0 0 0 38 0 0 0 0 0][ 0 0 0 0 0 30 0 0 0 0]
[ 0 0 0 0 0 0 37 0 0 0][ 0 0 0 0 1 0 0 34 1 1]
[ 0 0 0 0 0 1 0 0 28 0][ 1 0 0 0 0 1 0 0 2 30]]
sgd의 정확도 : 0.9611111111111111

로지스틱 리그레션

#모델의 적용
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

#예측해보기
y_pred = logistic_model.predict(X_test)

# 분석결과 확인
logistic_report = classification_report(y_test, y_pred)
print(logistic_report)

# 오차행렬인 컴퓨전 메트릭스 확인
logistic_matrix = confusion_matrix(y_test, y_pred)
print(logistic_matrix)

#정확도 비교해보기
logistic_accuracy = accuracy_score(y_test, y_pred)
print('LogisticRegression의 정확도 : ',logistic_accuracy)

          precision    recall  f1-score   support

       0       1.00      0.98      0.99        43
       1       1.00      0.97      0.99        35
       2       1.00      0.97      0.99        36
       3       0.95      0.95      0.95        41
       4       0.97      1.00      0.99        38
       5       0.94      0.97      0.95        30
       6       1.00      1.00      1.00        37
       7       1.00      0.97      0.99        37
       8       0.90      0.93      0.92        29
       9       0.91      0.94      0.93        34

micro avg 0.97 0.97 0.97 360
macro avg 0.97 0.97 0.97 360
weighted avg 0.97 0.97 0.97 360

[[42 0 0 0 1 0 0 0 0 0][ 0 34 0 0 0 0 0 0 1 0]
[ 0 0 35 1 0 0 0 0 0 0][ 0 0 0 39 0 0 0 0 1 1]
[ 0 0 0 0 38 0 0 0 0 0][ 0 0 0 1 0 29 0 0 0 0]
[ 0 0 0 0 0 0 37 0 0 0][ 0 0 0 0 0 0 0 36 0 1]
[ 0 0 0 0 0 1 0 0 27 1][ 0 0 0 0 0 1 0 0 1 32]]
LogisticRegression의 정확도 : 0.9694444444444444

print('의사결정나무의 정확도 : ',decision_accuracy)
print('랜덤포레스트의 정확도 : ',random_accuracy)
print('SVM의 정확도 : ',svm_accuracy)
print('sgd의 정확도 : ',sgd_accuracy)
print('LogisticRegression의 정확도 : ',logistic_accuracy)

의사결정나무의 정확도 : 0.8611111111111112
랜덤포레스트의 정확도 : 0.9694444444444444
SVM의 정확도 : 0.9916666666666667
sgd의 정확도 : 0.9611111111111111
LogisticRegression의 정확도 : 0.9694444444444444

모델의 정확도를 비교해보니 SVM이 가장 좋은 성능을 가졌습니다. 사실 머신러닝 분류문제에서는 랜덤포레스트가 가장 좋은 성능을 보입니다. 이미지 데이터이다 보니 SVM의 특성상 분류 선을 기가막히게 분류를 한건 아닌가 생각해봅니다. 여기서 랜덤포레스트가 로지스틱리그레션 보다 안좋은 이유는 렌덤포레스트를 적용할때 비교 의사결정 나무로 32개로 설정했기 때문입니다. 수를 늘리면 더 좋은 성능을 보일 것입니다.


#모델의 적용
random_forest = RandomForestClassifier(random_state=55)#바꾼 값
random_forest.fit(X_train, y_train)

#예측해보기
y_pred = random_forest.predict(X_test)

# 분석결과 확인
random_report = classification_report(y_test, y_pred)
print(random_report)

# 오차행렬인 컴퓨전 메트릭스 확인
random_matrix = confusion_matrix(y_test, y_pred)
print(random_matrix)

#정확도 비교해보기
random_accuracy = accuracy_score(y_test, y_pred)
print('랜덤포레스트의 정확도 : ',random_accuracy)

          precision    recall  f1-score   support

       0       1.00      0.95      0.98        43
       1       0.95      1.00      0.97        35
       2       1.00      0.97      0.99        36
       3       0.98      0.98      0.98        41
       4       0.95      0.97      0.96        38
       5       1.00      0.93      0.97        30
       6       1.00      1.00      1.00        37
       7       0.97      0.97      0.97        37
       8       0.96      0.93      0.95        29
       9       0.89      0.97      0.93        34

micro avg 0.97 0.97 0.97 360
macro avg 0.97 0.97 0.97 360
weighted avg 0.97 0.97 0.97 360

[[41 0 0 0 2 0 0 0 0 0][ 0 35 0 0 0 0 0 0 0 0]
[ 0 0 35 0 0 0 0 0 1 0][ 0 0 0 40 0 0 0 0 0 1]
[ 0 0 0 0 37 0 0 1 0 0][ 0 1 0 0 0 28 0 0 0 1]
[ 0 0 0 0 0 0 37 0 0 0][ 0 0 0 0 0 0 0 36 0 1]
[ 0 1 0 0 0 0 0 0 27 1][ 0 0 0 1 0 0 0 0 0 33]]
랜덤포레스트의 정확도 : 0.9694444444444444

위 random_state의 값을 변경하니 전보다 좋은 성능을 보입니다.

사실 머신러닝으로 이미지분류를 하는 문제에서는 좋지 못한 성능을 보입니다. 손글씨와 같은 흑백 컬러에 단순 숫자만 들어가 있기 때문에 어느정도 좋은 성능을 보여주는 것입니다. 색이 다양하고 각도가 다르고 숫자의 크기가 다르기 시작하면 위에서 사용한 머신러닝 모델들은 분명 확 안좋은 성능을 보이기 시작할 것입니다. 어쨋든 간단한 손글씨 분류 문제에서는 로지스틱리그레션이랑 랜덤포레스트, SVM이 좋은 성능을 보이네요!