KNN - K Nearest Neighbors

YU NA Joe·2022년 4월 12일

machine learning

KNN - K Nearest Neighbors (최근접이웃)

Supervised, Classification Algorithm
it classifies a data point based on how its neighbors are classified
https://hleecaster.c om/ml-knn-concept/ (쉽게 설명 good)

How do we choose the factor k?

be based on feature similiary. Choosing the right value of k is a proess called parameter tuning
Threre are 2 ways generally.
1) Sqrt(n), where n is the total numbef of data points.
2) Odd value of k is added to avoid confusion b/w two classes of data

when do we use KNN Algorithm?

Data is labeled (supervised, classification)
Data is noise free
Dataset is small ( b/c Knn is lazy leaner i.e it doesn't learn a discriminative function from the training set )

How does KNN Algorithm work?

ㅡ

## predict whether a person will have diabets or not

import pandas as pd 
import numpy as np 

data = pd.read_csv("C:\\LECTRUE\\dataSet\\diabetes.csv")
data

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1
...	...	...	...	...	...	...	...	...	...
763	10	101	76	48	180	32.9	0.171	63	0
764	2	122	70	27	0	36.8	0.340	27	0
765	5	121	72	23	112	26.2	0.245	30	0
766	1	126	60	0	0	30.1	0.349	47	1
767	1	93	70	31	0	30.4	0.315	23	0

768 rows × 9 columns

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

data.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

outcome에 영향을 주는 column들 값들이 0이면 안된다. 따라서 여기서는 그냥 mean으로 대체하겠다.

How?!

np.nan이 없으므로, 0 -> np.nan -> mean

# replace zero with something 
zero_not_accepted = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]

for col in zero_not_accepted:
    data[col] = data[col].replace(0, np.NaN)  # 0 을 np.nan
    data[col] = data[col].replace(np.NaN, data[col].mean())  # np.nan을 mean으로

data

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148.0	72.0	35.00000	155.548223	33.6	0.627	50	1
1	1	85.0	66.0	29.00000	155.548223	26.6	0.351	31	0
2	8	183.0	64.0	29.15342	155.548223	23.3	0.672	32	1
3	1	89.0	66.0	23.00000	94.000000	28.1	0.167	21	0
4	0	137.0	40.0	35.00000	168.000000	43.1	2.288	33	1
...	...	...	...	...	...	...	...	...	...
763	10	101.0	76.0	48.00000	180.000000	32.9	0.171	63	0
764	2	122.0	70.0	27.00000	155.548223	36.8	0.340	27	0
765	5	121.0	72.0	23.00000	112.000000	26.2	0.245	30	0
766	1	126.0	60.0	29.15342	155.548223	30.1	0.349	47	1
767	1	93.0	70.0	31.00000	155.548223	30.4	0.315	23	0

768 rows × 9 columns

학습 데이터와 테스트 데이터 세트로 분리

training = data.iloc[:,:-1]
training

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	6	148.0	72.0	35.00000	155.548223	33.6	0.627	50
1	1	85.0	66.0	29.00000	155.548223	26.6	0.351	31
2	8	183.0	64.0	29.15342	155.548223	23.3	0.672	32
3	1	89.0	66.0	23.00000	94.000000	28.1	0.167	21
4	0	137.0	40.0	35.00000	168.000000	43.1	2.288	33
...	...	...	...	...	...	...	...	...
763	10	101.0	76.0	48.00000	180.000000	32.9	0.171	63
764	2	122.0	70.0	27.00000	155.548223	36.8	0.340	27
765	5	121.0	72.0	23.00000	112.000000	26.2	0.245	30
766	1	126.0	60.0	29.15342	155.548223	30.1	0.349	47
767	1	93.0	70.0	31.00000	155.548223	30.4	0.315	23

768 rows × 8 columns

test = data.iloc[:,-1]
test

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(training, test,  test_size =0.3, random_state=10)

Feature Scaling

a technique to standardize the independent features present in the data in a fixed range

Standarization
Normalization

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# fit_transform 과 transform 의 차이점은?! m

학습 데이터 세트로 학습(Train)수행 with Model ( KNN)

from sklearn.neighbors import KNeighborsClassifier
# initinating the object 

knn = KNeighborsClassifier()

# Hyperameter Tuning

How de we choose n_neighbors?

Threre are 2 ways generally.
1) Sqrt(n), where n is the total numbef of data points.
2) Odd value of k is added to avoid confusion b/w two classes of data

1번을 사용해보장.

import math 
math.sqrt(len(data))

27.712812921102035

math.sqrt(len(y_test))

15.198684153570664

# n_neighbors = 15로 하쟈 
knn = KNeighborsClassifier(n_neighbors = 15)

# p?!
- p = 1, manhattan_distance
- p = 2, euclidean_distance

  File "<ipython-input-99-ff1d3f08c0cf>", line 2
    - p = 1, manhattan_distance
                               ^
SyntaxError: can't assign to operator

# eculidean 사용하기 
knn = KNeighborsClassifier(n_neighbors = 15, p=2)

# data를 가지고 모델을 training 시킴

knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=15)

# training된 모델을 가지고 예측을 해 봄
pred = knn.predict(X_test)
pred

array([1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

y_test

568    0
620    0
456    0
197    1
714    0
      ..
345    0
408    1
304    0
686    0
202    0
Name: Outcome, Length: 231, dtype: int64

# Evalutate Model 

from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)
print("{0: 3f}".format(accuracy_score(y_test, pred)))

 0.705628

출처: https://www.youtube.com/watch?v=4HKqjENq9OU&t=1283s

YU NA Joe

이전 포스트

Python TypeError: ‘NoneType’ object has no attribute ‘append’ Solution

다음 포스트

KNN - K Nearest Neighbors

1개의 댓글

Malcolma Valenzuelad

2024년 10월 7일

Block Blast is regularly updated with new features and content to keep the game fresh and exciting.

답글 달기