KNN - K Nearest Neighbors

YU NA Joe·2022년 4월 12일
0

KNN - K Nearest Neighbors (최근접이웃)

  • Supervised, Classification Algorithm
  • it classifies a data point based on how its neighbors are classified
  • https://hleecaster.c om/ml-knn-concept/ (쉽게 설명 good)

How do we choose the factor k?

  • be based on feature similiary. Choosing the right value of k is a proess called parameter tuning
  • Threre are 2 ways generally.
    1) Sqrt(n), where n is the total numbef of data points.
    2) Odd value of k is added to avoid confusion b/w two classes of data

1.JPG

when do we use KNN Algorithm?

  • Data is labeled (supervised, classification)
  • Data is noise free
  • Dataset is small ( b/c Knn is lazy leaner i.e it doesn't learn a discriminative function from the training set )

How does KNN Algorithm work?

1.JPG

2.JPG

3.JPG

4.JPG

## predict whether a person will have diabets or not
import pandas as pd 
import numpy as np 

data = pd.read_csv("C:\\LECTRUE\\dataSet\\diabetes.csv")
data
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
... ... ... ... ... ... ... ... ... ...
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 0 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0

768 rows × 9 columns

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
data.isnull().sum()
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
data.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

outcome에 영향을 주는 column들 값들이 0이면 안된다. 따라서 여기서는 그냥 mean으로 대체하겠다.

How?!

np.nan이 없으므로, 0 -> np.nan -> mean

# replace zero with something 
zero_not_accepted = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]

for col in zero_not_accepted:
    data[col] = data[col].replace(0, np.NaN)  # 0 을 np.nan
    data[col] = data[col].replace(np.NaN, data[col].mean())  # np.nan을 mean으로 
data
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148.0 72.0 35.00000 155.548223 33.6 0.627 50 1
1 1 85.0 66.0 29.00000 155.548223 26.6 0.351 31 0
2 8 183.0 64.0 29.15342 155.548223 23.3 0.672 32 1
3 1 89.0 66.0 23.00000 94.000000 28.1 0.167 21 0
4 0 137.0 40.0 35.00000 168.000000 43.1 2.288 33 1
... ... ... ... ... ... ... ... ... ...
763 10 101.0 76.0 48.00000 180.000000 32.9 0.171 63 0
764 2 122.0 70.0 27.00000 155.548223 36.8 0.340 27 0
765 5 121.0 72.0 23.00000 112.000000 26.2 0.245 30 0
766 1 126.0 60.0 29.15342 155.548223 30.1 0.349 47 1
767 1 93.0 70.0 31.00000 155.548223 30.4 0.315 23 0

768 rows × 9 columns

학습 데이터와 테스트 데이터 세트로 분리

training = data.iloc[:,:-1]
training
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 6 148.0 72.0 35.00000 155.548223 33.6 0.627 50
1 1 85.0 66.0 29.00000 155.548223 26.6 0.351 31
2 8 183.0 64.0 29.15342 155.548223 23.3 0.672 32
3 1 89.0 66.0 23.00000 94.000000 28.1 0.167 21
4 0 137.0 40.0 35.00000 168.000000 43.1 2.288 33
... ... ... ... ... ... ... ... ...
763 10 101.0 76.0 48.00000 180.000000 32.9 0.171 63
764 2 122.0 70.0 27.00000 155.548223 36.8 0.340 27
765 5 121.0 72.0 23.00000 112.000000 26.2 0.245 30
766 1 126.0 60.0 29.15342 155.548223 30.1 0.349 47
767 1 93.0 70.0 31.00000 155.548223 30.4 0.315 23

768 rows × 8 columns

test = data.iloc[:,-1]
test
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(training, test,  test_size =0.3, random_state=10)

Feature Scaling

  • a technique to standardize the independent features present in the data in a fixed range
  1. Standarization

  2. Normalization

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# fit_transform 과 transform 의 차이점은?! m

학습 데이터 세트로 학습(Train)수행 with Model ( KNN)

from sklearn.neighbors import KNeighborsClassifier
# initinating the object 

knn = KNeighborsClassifier()
# Hyperameter Tuning 

How de we choose n_neighbors?

  • Threre are 2 ways generally.
    1) Sqrt(n), where n is the total numbef of data points.
    2) Odd value of k is added to avoid confusion b/w two classes of data

1번을 사용해보장.

import math 
math.sqrt(len(data))
27.712812921102035
math.sqrt(len(y_test))
15.198684153570664
# n_neighbors = 15로 하쟈 
knn = KNeighborsClassifier(n_neighbors = 15)
# p?!
- p = 1, manhattan_distance
- p = 2, euclidean_distance
  File "<ipython-input-99-ff1d3f08c0cf>", line 2
    - p = 1, manhattan_distance
                               ^
SyntaxError: can't assign to operator
# eculidean 사용하기 
knn = KNeighborsClassifier(n_neighbors = 15, p=2)
# data를 가지고 모델을 training 시킴

knn.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=15)
# training된 모델을 가지고 예측을 해 봄
pred = knn.predict(X_test)
pred
array([1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)
y_test
568    0
620    0
456    0
197    1
714    0
      ..
345    0
408    1
304    0
686    0
202    0
Name: Outcome, Length: 231, dtype: int64
# Evalutate Model 

from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)
print("{0: 3f}".format(accuracy_score(y_test, pred)))
 0.705628
출처: https://www.youtube.com/watch?v=4HKqjENq9OU&t=1283s

0개의 댓글