MNIST
![](https://velog.velcdn.com/images/tim0902/post/9b29fa54-55e1-409e-9d37-b0efea04ad40/image.png)
![](https://velog.velcdn.com/images/tim0902/post/93e21a1e-89b2-43ef-b1ad-0e2fcff10d24/image.png)
![](https://velog.velcdn.com/images/tim0902/post/f5187920-b757-4a6b-98a0-e0591a4cf58d/image.png)
![](https://velog.velcdn.com/images/tim0902/post/9ec72afe-639b-4ea5-9986-b20ff6bcb5f3/image.png)
using PCA and kNN
데이터 읽기
import pandas as pd
df_train = pd.read_csv('../ds_study/data/mnist_train.csv')
df_test = pd.read_csv('../ds_study/data/mnist_test.csv')
df_train.shape, df_test.shape
![](https://velog.velcdn.com/images/tim0902/post/a78ca8e4-e924-4cd9-912a-f95cb2e568a9/image.png)
train 데이터 모양
df_train.head()
![](https://velog.velcdn.com/images/tim0902/post/4fc76006-a051-48be-94cb-907b162d0c66/image.png)
import numpy as np
np.sqrt(784)
![](https://velog.velcdn.com/images/tim0902/post/12fa3390-fe80-47a5-9e66-4bb0fab32624/image.png)
test 데이터 모양
df_test
![](https://velog.velcdn.com/images/tim0902/post/99cd533a-3604-44d0-a30e-6ff3336d8a7d/image.png)
데이터 정리
import numpy as np
X_train = np.array(df_train.iloc[:,1:])
y_train = np.array(df_train['label'])
X_test = np.array(df_test.iloc[:,1:])
y_test = np.array(df_test['label'])
X_train.shape, y_train.shape, X_test.shape, y_test.shape
![](https://velog.velcdn.com/images/tim0902/post/8378e6c2-ee97-4342-bb49-08bbda4b3d7e/image.png)
데이터 확인
import random
samples = random.choices(population=range(0,60000), k=16)
samples
![](https://velog.velcdn.com/images/tim0902/post/be496987-2270-4b4f-878f-60b9c152c78f/image.png)
random하게 16개
import matplotlib.pyplot as plt
plt.figure(figsize=(14,12))
for idx, n in enumerate(samples):
plt.subplot(4, 4, idx+1)
plt.imshow(X_train[n].reshape(28,28), cmap='Greys', interpolation='nearest')
plt.title(y_train[n])
plt.show()
![](https://velog.velcdn.com/images/tim0902/post/870c45bf-0636-4a7f-b6a5-a4d2a71de2a9/image.png)
fit
from sklearn.neighbors import KNeighborsClassifier
import time
start_time = time.time()
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
print('Fit time :', time.time() - start_time)
![](https://velog.velcdn.com/images/tim0902/post/20c33579-9a48-4b8f-baa0-fa457a5aadfa/image.png)
test 데이터 predict
from sklearn.metrics import accuracy_score
start_time = time.time()
pred = clf.predict(X_test)
print('Fit time :', time.time() - start_time)
print(accuracy_score(y_test, pred))
![](https://velog.velcdn.com/images/tim0902/post/79f489a6-c898-4908-845b-626eed635346/image.png)
PCA로 차원을 줄여주자
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV, StratifiedKFold
pipe = Pipeline([
('pca', PCA()),
('clf', KNeighborsClassifier())
])
parameters = {
'pca__n_components' : [2, 5, 10],
'clf__n_neighbors' : [5, 10, 15]
}
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=13)
grid = GridSearchCV(pipe, parameters, cv=kf, n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)
![](https://velog.velcdn.com/images/tim0902/post/c55c8fb3-94f9-4878-9299-e3b3d59172c0/image.png)
best score
grid.best_score_
![](https://velog.velcdn.com/images/tim0902/post/631e49d1-8c91-4f8f-bf70-256c47eb2ff2/image.png)
grid.best_params_
![](https://velog.velcdn.com/images/tim0902/post/a17e9b61-0c5e-436f-8792-967004ca12ce/image.png)
단지 이정도 수준으로 약 93%의 acc가 확보된다.
pred = grid.best_estimator_.predict(X_test)
print(accuracy_score(y_test, pred))
![](https://velog.velcdn.com/images/tim0902/post/dd78ceb1-34c2-4f7c-8ff2-6b44ef257139/image.png)
결과 확인
def results(y_pred, y_test):
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))
results(grid.predict(X_train), y_train)
![](https://velog.velcdn.com/images/tim0902/post/def0f2a1-fb56-4432-92a8-77a784156d4f/image.png)
숫자를 다시 확인하고 싶다면
n = 700
plt.imshow(X_test[n].reshape(28,28), cmap='Greys', interpolation='nearest')
plt.show()
print('Answer is :', grid.best_estimator_.predict(X_test[n].reshape(1, 784)))
print('Real Label is :', y_test[n])
![](https://velog.velcdn.com/images/tim0902/post/42c992e9-70a8-43a4-91cb-189550213f16/image.png)
틀린 데이터를 확인
preds = grid.best_estimator_.predict(X_test)
preds
![](https://velog.velcdn.com/images/tim0902/post/3b1f5267-6d88-4e83-9370-c7de2a7a3801/image.png)
y_test
![](https://velog.velcdn.com/images/tim0902/post/7ecec785-9aad-4135-8844-ee56bfbf1ec1/image.png)
틀린 데이터를 추려서
wrong_results = X_test[y_test != preds]
samples = random.choices(population=range(0,wrong_results.shape[0]), k=16)
plt.figure(figsize=(14,12))
for idx, n in enumerate(samples):
plt.subplot(4, 4, idx +1)
plt.imshow(wrong_results[n].reshape(28,28), cmap='Greys')
pred_digit = grid.best_estimator_.predict(wrong_results[n].reshape(1,784))
plt.title(str(pred_digit))
plt.show()
![](https://velog.velcdn.com/images/tim0902/post/59619ea3-40ee-4a4e-a7bd-34f00316806c/image.png)