[Kaggle] Digit Recognizer

ByungJik_OhΒ·2025λ…„ 4μ›” 16일

[Kaggle]

λͺ©λ‘ 보기
2/3
post-thumbnail


πŸ’‘ 문제

μ΄λ―Έμ§€μ˜ 각 픽셀에 λŒ€ν•œ 값을 ν™œμš©ν•˜μ—¬ 0~9κΉŒμ§€μ˜ 숫자λ₯Ό μ˜ˆμΈ‘ν•˜λŠ” λͺ¨λΈμ„ κ΅¬ν˜„ν•œλ‹€.

πŸ”₯ μ˜ˆμΈ‘μ— μ‚¬μš©ν•  λͺ¨λΈ : DNN (Deep Neural Network) - 닀쀑뢄λ₯˜


πŸ“– 데이터 μ…‹

28 x 28 (784) 크기의 숫자 손글씨에 λŒ€ν•œ 2차원 ν”½μ…€ 데이터가 1차원 ν˜•νƒœλ‘œ λ“€μ–΄μžˆλ‹€.


πŸ“’ μ½”λ“œ

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import EarlyStopping

from sklearn.metrics import classification_report

πŸ“š Raw Data Loading

# Raw Data Loading
df = pd.read_csv('/content/drive/MyDrive/KDT/kaggle/Digit Recognizer/01. train.csv')
display(df.head())


πŸ“š 데이터 μ‹œκ°ν™”

fig = plt.figure()

axes = []
for i in range(10):
    axes.append(fig.add_subplot(2, 5, i + 1))
    axes[i].imshow(x_data[i].reshape(28, 28), cmap='gray_r')

plt.tight_layout()
plt.show()

μ–΄λ–€ 데이터가 μžˆλŠ”μ§€ 10개의 λ°μ΄ν„°λ§Œ 뽑아 μ΄λ―Έμ§€λ‘œ 좜λ ₯ν•΄ λ³΄μ•˜λ‹€. λ˜ν•œ, 이 데이터 μ…‹μ˜ 경우 결츑치, μ΄μƒμΉ˜, 쀑볡 데이터에 λŒ€ν•œ μ²˜λ¦¬κ°€ μ™„λ²½ν•˜κ²Œ μ§„ν–‰λ˜μ–΄μžˆμ–΄ λ”°λ‘œ μ „μ²˜λ¦¬ μž‘μ—…μ„ ν•  것이 μ—†λ‹€.


πŸ“š μ •κ·œν™”

x_data = df.drop('label', axis=1, inplace=False).values
t_data = df['label'].values.reshape(-1, 1)

scaler = MinMaxScaler()
scaler.fit(x_data)
x_data_norm = scaler.transform(x_data)

0~255κΉŒμ§€μ˜ 연속적인 μ‹€μˆ˜ 데이터 값이 λ“€μ–΄μžˆμœΌλ―€λ‘œ λͺ¨λΈμ˜ ν•™μŠ΅μ„ μœ„ν•΄ Min-Max Scaling 처리λ₯Ό ν•˜μ˜€λ‹€.


πŸ“š 데이터 λΆ„ν• 

x_data_train_norm, x_data_test_norm, t_data_train, t_data_test = \
train_test_split(x_data_norm, t_data,
                 test_size=0.2, stratify=t_data)

λͺ¨λΈ ν•™μŠ΅ ν›„ λͺ¨λΈ 검증을 μœ„ν•΄ ν•™μŠ΅λ°μ΄ν„°μ™€ ν…ŒμŠ€νŠΈ 데이터λ₯Ό λ‚˜λˆ„μ–΄μ£Όμ—ˆλ‹€.


πŸ“š DNN Model κ΅¬ν˜„

model = Sequential()
model.add(Flatten(input_shape=(784,)))
model.add(Dense(units=64, activation='relu'))
model.add(Dense(units=128, activation='relu'))
model.add(Dense(units=10, activation='softmax'))
model.compile(optimizer=SGD(learning_rate=1e-1),
              loss = 'sparse_categorical_crossentropy',
              metrics=['acc'])
es_callback = EarlyStopping(monitor='val_loss', patience=5,
                            restore_best_weights=True, verbose=1)
                            
model.fit(x_data_train_norm, t_data_train,
          epochs=100, verbose=1,
          validation_split=0.3, batch_size=100,
          callbacks=[es_callback])

이진 λ‘œμ§€μŠ€ν‹± λͺ¨λΈμ΄κΈ° λ•Œλ¬Έμ— ν™œμ„±ν™” ν•¨μˆ˜λ‘œ 'softmax' ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ˜€κ³ , 이 λͺ¨λΈμ˜ 경우 닀쀑 λΆ„λ₯˜ λͺ¨λΈλ‘œ, 0~9κΉŒμ§€μ— λŒ€ν•œ μ’…μ†λ³€μˆ˜μ— λŒ€ν•΄ One-Hot Encoding μž‘μ—…μ„ ν•΄μ•Όν•˜μ§€λ§Œ λͺ¨λΈ λ‚΄λΆ€μ μœΌλ‘œ μ²˜λ¦¬ν•  수 μžˆλ„λ‘ 손싀 ν•¨μˆ˜λ‘œλŠ” 'sparse_categorical_crossentropy' ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ 맀우 κ°„λ‹¨ν•œ DNN λͺ¨λΈμ„ κ΅¬ν˜„ν•˜μ˜€λ‹€.


πŸ“š λͺ¨λΈ 평가

print(classification_report(t_data_test, np.argmax(model.predict(x_data_test_norm), axis=1)))
#               precision    recall  f1-score   support

#            0       0.97      0.98      0.98       827
#            1       0.97      0.99      0.98       937
#            2       0.96      0.96      0.96       835
#            3       0.94      0.96      0.95       870
#            4       0.98      0.94      0.96       814
#            5       0.95      0.95      0.95       759
#            6       0.98      0.97      0.98       827
#            7       0.97      0.96      0.97       880
#            8       0.95      0.94      0.95       813
#            9       0.93      0.95      0.94       838

#     accuracy                           0.96      8400
#    macro avg       0.96      0.96      0.96      8400
# weighted avg       0.96      0.96      0.96      8400

λͺ¨λΈ 평가 κ²°κ³Ό F1 Scoreκ°€ 0.96이 좜λ ₯된 것을 확인할 수 μžˆμ—ˆλ‹€.


πŸ“š Test Set 처리 및 μ •λ‹΅ 데이터 처리

# Test Data Loading
test = pd.read_csv('/content/drive/MyDrive/KDT/kaggle/Digit Recognizer/02. test.csv')

# Test Data Preprocessing
test_data_norm = scaler.transform(test.values)

# 예츑
test_result = model.predict(test_data_norm)
test_result = np.argmax(test_result,axis=1)

# Submission Data Loading
submission = pd.read_csv('/content/drive/MyDrive/KDT/kaggle/Digit Recognizer/03. sample_submission.csv')

# μ •λ‹΅ μž…λ ₯ 및 μΆ”μΆœ
submission['Label'] = test_result

submission.to_csv('Digit_Recognition_DNN.csv', index=False)

ν…ŒμŠ€νŠΈ 데이터λ₯Ό κ°€μ Έμ™€μ„œ ν›ˆλ ¨ 데이터와 λ˜‘κ°™μ΄ μ „μ²˜λ¦¬λ₯Ό μ§„ν–‰ν•˜μ˜€κ³ , 예츑 κ²°κ³Όλ₯Ό 제좜 데이터에 μ‚½μž…ν•˜μ—¬ μ œμΆœν•˜μ˜€λ‹€.


✨ 결과


πŸ’­ ν›„κΈ°

이 데이터 μ…‹μ˜ 경우 맀우 잘 μ •μ œλœ λ°μ΄ν„°λ‘œ λ”°λ‘œ μ „μ²˜λ¦¬ μž‘μ—… 없이도 κ½€ 높은 점수λ₯Ό 받을 수 μžˆμ—ˆλ‹€. 이λ₯Ό λ°”νƒ•μœΌλ‘œ 더 μ–΄λ €μš΄ 이미지 처리 λ¬Έμ œλ„ κ²½ν—˜ν•΄ 봐야겠닀.


πŸ”— 문제 좜처

https://www.kaggle.com/competitions/digit-recognizer


profile
η²Ύι€² "정성을 κΈ°μšΈμ—¬ λ…Έλ ₯ν•˜κ³  λ§€μ§„ν•œλ‹€"

0개의 λŒ“κΈ€