The K-Nearest Neighbors (KNN) classifier is a type of instance-based learning, or lazy learning, where the function is only approximated locally, and all computation is deferred until function evaluation. It is one of the simplest of all machine learning algorithms, primarily used for classification, and operates on the principle that similar things exist in close proximity.
In KNN, the classification of an observation is determined by a plurality vote of its neighbors, with the observation being assigned to the class most common among its nearest neighbors measured by a distance metric (e.g., Euclidean distance). is a positive integer, typically small. If , then the object is simply assigned to the class of its nearest neighbor.
The choice of distance metrics can significantly influence the performance of KNN. Common distance metrics include:
Given a dataset containing samples with their corresponding labels, the task is to classify a new sample . The distance between and each sample in is calculated using a chosen distance metric. The samples in with the smallest distance to are identified, and the frequency of each class within these samples is counted:
The predicted class for is then given by:
n_neighbors
: int
, default = 5Test with wine dataset and reduce dimensionality with RFE
using GaussianNaiveBayes
as base estimator:
from luma.classifier.naive_bayes import GaussianNaiveBayes
from luma.classifier.neighbors import KNNClassifier
from luma.preprocessing.scaler import StandardScaler
from luma.reduction.selection import RFE
from luma.model_selection.split import TrainTestSplit
from luma.model_selection.search import GridSearchCV
from luma.visual.evaluation import DecisionRegion, ConfusionMatrix
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
import numpy as np
X, y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = TrainTestSplit(X, y,
test_size=0.2,
random_state=42).get
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.fit_transform(X_test)
rfe = RFE(estimator=GaussianNaiveBayes(),
n_features=2,
step_size=1,
cv=5,
random_state=42,
verbose=True)
rfe.fit(X_train_std, y_train)
X_train_rfe = rfe.transform(X_train_std)
X_test_rfe = rfe.transform(X_test_std)
param_grid = {'n_neighbors': range(2, 10)}
grid = GridSearchCV(estimator=KNNClassifier(),
param_grid=param_grid,
cv=5,
refit=True,
random_state=42)
grid.fit(X_train_rfe, y_train)
knn_best = grid.best_model
X_concat = np.concatenate((X_train_rfe, X_test_rfe))
y_concat = np.concatenate((y_train, y_test))
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
dec = DecisionRegion(knn_best, X_concat, y_concat)
dec.plot(ax=ax1)
conf = ConfusionMatrix(y_concat, knn_best.predict(X_concat))
conf.plot(ax=ax2, show=True)
# RFE best score: 0.94408766
# RFE final features: (9, 12)
# Best params: {'n_neighbors': 5}
# Best score: 0.9299539170506913
KNN is widely used in a variety of applications, such as:
- Cover, Thomas, and Peter Hart. "Nearest neighbor pattern classification." IEEE transactions on information theory 13.1 (1967): 21-27.
- Dasarathy, Belur V., ed. "Nearest neighbor (NN) norms: NN pattern classification techniques." IEEE Computer Society Press, 1991.
- Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. "The elements of statistical learning." Vol. 1. No. 10. New York: Springer series in statistics, 2001.