Supervised Learning with scikit-learn

지니🧸·2022년 10월 17일
0

ML, DL, etc.

목록 보기
1/3

Supervised Learning with scikit-learn

Classification


Supervised Learning

  • Machine Learning: the art and science of giving computers the ability to learn to make decisions from data w/o being explicitly programmed
    • (ex) learning to predict whether an email is spam, clustering Wikipedia into different categories, etc.
  • Unsupervised learning: uses unlabeled data
    • uncovering hidden patterns from unlabeled data
    • (ex) clustering: grouping customers into distinct categories
  • Reinforcement learning
    • software agents interact with an environment
      • learn how to optimize their behavior
      • given a system of rewards & punishments
      • draws inspiration from behavioral psychology
  • Supervised learning: uses labeled data
    • predictor variables/features and a target variable
    • aim: predict the target variable, given the predictor variables
      • classification: target variable consists of categories
      • regression: target variable is continuous
    • applications
      • automate time-consuming or expensive manual tasks
      • make predictions about the future
      • need labeled data

Exploratory data analysis

The Iris dataset in scikit-learn

from scklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
iris = datasets.load_iris()

type(iris) #sklearn.datasets.base.Bunch
print(iris.keys()) 
type(iris.data), type(iris.target)
iris.data.shape
iris.target_names
x = iris.data
y = iris.target
df = pd.DataFrame(x, columns=iris.feature_names)

Visual EDA

_ = pd.plotting.scatter_matrix(df, c = y, figsize = [8,8], 
													s= 150, marker='D')
  • c: color (data points in the figure will be colored by this value)
  • figsize: size of figure

Result

Untitled

  • diagonal: histograms of the features corresponding to row & column
  • off-diagonal: scatter plots of the column feature vs. row feature colored by target variable

The Classification Challenge

K-Nearest Neighbors (KNN)

  • predicts the label of a data point by looking at the 'k' closest labeled data points
    • the data points vote on what label the unlabeled point should have

Scikit-learn fit and predict

  • all ML models implemented as Python classes
    • they implement the algorithms for learning & predicting
    • store the information learned from the data
  • training a model on the data = 'fitting' a model to the data
    • .fit() method
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(iris['data'], iris['target'])
  • data should be a NumPy array or pandas DataFrame
  • features should be continuous values (not categories)
  • there should be no missing values in the data
  • each column is a feature & each row is a data point
  • in fitting method, first data should be features & second should be target variable
x_new = np.array([[5.6, 2.5, 3.1, 2.6], [5.7, 2.6, 3.1, 2.6],
			[1.5, 2.7, 4.1, 2.7]])
prediction = knn.predict(x_new)
  • prediction: 3 by 1 array with a prediction for each observation in x_new

Measuring model performance

  • accuracy is a commonly used metric in classification
    • accuracy = fraction of correct predictions

Procedure

  1. split data into training & test set
  2. fit/train the classifier on the training set
  3. make predictions on test set
  4. compare predictions with the known labels
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
																								random_state=21, stratify=y)
  • train_test_split()
    • 1st argument: feature data, 2nd argument: targets or labels
    • test_size: proportion of the original data to be used for the test set
    • random_state: sets a seed for the random number generator that splits the data into train & test
    • stratify: should equal label to ensure that labels in train & test sets are as they are in the original dataset
    • returns 4 arrays: training data, test data, training labels, test labels
    • by default, test data: 25%, training data: 75%
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
prediction = knn.predict(X_test)
knn.score(X_test, y_test)

Model Complexity

  • larger k = smoother decision boundary = less complex model
  • complex models run the risk of being sensitive to noise in the data, rather than reflecting trend —> overfitting
  • too big K value —> underfitting

Regression


Introduction to Regression

  • target value: continuous value
boston = pd.read_csv('boston.csv')
X = boston.drop('MEDV', axis=1).values #drop the target column
y = boston['MEDV'].values

#predicting house value from a single feature
X_rooms = X[:, 5]
y = y.reshape(-1, 1)
X_rooms = X_rooms.reshape(-1, 1)

Fitting a regression model

import numpy as np
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_rooms, y)
prediction_space = np.linspace(min(X_rooms), max(X-rooms)).reshape(-1,1)

Basics of Linear Regression

  • define an error function for any given line & choose the line that minimizes the error function
    • aka lost function (cost function)
  • we minimize the sum of squares of residual
    • Ordinary least squares (OLS): minimize sum of squares of residuals
  • Linear regression higher dimensions
    • must specify coefficient for each feature (x) & the variable b
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
reg_all.score(X_test, y_test)

Cross-validation

  • cross-validation motivation
    • model performance is dependent on way the data is split
    • not representative of model's ability to generalize
    • so, we use cross-validation to avoid the problem of the chosen metric being dependent on the train test split
  • cross-validation basics
    • split the dataset into 5 groups (folds)
    • hold out the first fold as a test set & fit the model on the remaining 4 folds
    • predict on the test
    • compute the metric of interest
    • repeat for the second fold, third fold.. etc.
  • interpreting cross-validation
    • as a result of cross-validation, we get 5 values of R squared from which we compute statistics of interests (mean, median, 95% confidence intervals)
  • cross-validation & model performance
    • split into five folds - 5-fold cross validation
    • split into 10 folds - 10-fold cross validation
    • split into k folds - k-fold cross validation (CV)
    • trade-off of using more folds: more computationally expensive

Cross-validation in scikit-learn

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv = 5)
  • cross_val_score(regressor, feature data, target data, cv = #)
    • cv: specifies number of folds
    • returns an array of cross-validation scores
      • length of array = # of folds utilized
      • score is R-squared (by default)

Regularized regression

  • why regularize?
    • if linear regression occurs in a high-dimensional space with large coefficients, it may lead to overfitting
    • we should penalize large coefficients —> Regularization

Ridge regression

  • loss function = OLS loss function + Untitled
    • models are penalized for coefficients w/ a large magnitude
    • alpha: parameter we need to choose
      • controls complexity
      • alpha = 0: OLS (possibly overfitting)
      • high alpha: large coefficients are significantly penalized (underfitting)
from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_tst = train_test_split(X, y, test_size=0.3, 
																													random_state=42)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train, y_train)
ridge_pred=ridge.predict(X_test)
ridge.score(X_test, y_test)
  • Ridge(alpha = #, normalize = True/False)
    • alpha: alpha value
    • normalize = True: all variables are on the same scale

Lasso regression

  • loss function = OLS loss function + Untitled
  • can be used to select important features of a dataset
    • shrinks the coefficients of less important features to 0
    • others are selected by the algorithms
from sklearn.linear_model import Lasso
X_train, X_test, y_train, y_tst = train_test_split(X, y, test_size=0.3 random_state=42)
lasso = Lasso(alpha=0.1, normalize=True)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
lasso.score(X_test, y_test)

Feature selection of Lasso regression

from sklearn.linear_model import Lasso
names = boston.drop('MEDV', axis=1).columns
lasso = Lasso(alpha=0.1)
lasso_coef = lasso.fit(X, y).coef_
_ = plt.plot(range(len(names)), lass_coef)
_ = plt.xticks(range(len(names)), names, rotation=60)
_ = plt.ylabel('Coefficients')
plt.show()

Untitled

Fine-tuning your model


How good is your model?

  • accuracy may not be the best measure
  • ways to diagnose classification predictions
    • Confusion matrix: 2-by-2 matrix that summarizes predictive performance (given binary classifier) Untitled
      • top left & bottom right are correctly labeled (True)
      • class of interest: positive class (i.e. spam)
      • accuracy = (sum of the diagonal) / (total sum of the matrix)
    • Metrics from the confusion matrix
      • precision = (number of true positives) / (total number of true positives and false positives)
        • aka positive predictive value (PPV)
        • high precision = low false positive rate: low false positive rate
          • (ex) not many real emails were predicted being spam
      • recall = (number of true positives) / (total number of true positives and false negatives)
        • aka sensitivity, hit rate, true positive rate
        • high recall: classifier predicted most positives correctly
      • F1 score = 2 (precision recall) / (precision + recall)
        • aka harmonic mean of precision and recall

Confusion matrix in scikit-learn

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
knn = KNeighborsClassifier(n_neighbors=8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state = 42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
  • classification_report(y_test, y_pred)
    • 1st argument: true label, 2nd argument: prediction

Logistic regression and the ROC curve

  • Logistic regression for binary classification
    • logistic regression outputs probabilities
      • given 1 feature, log reg outputs a probability p with respect to the target variable
      • p > 0.5 → data labeled as 1
      • p < 0.5 → data labeled as 0
    • log reg produces a linear decision boundary

Untitled

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state = 42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
  • by default, logistic regression threshold = 0.5
  • thresholds affect true & false positive rates
    • threshold == 0: model predicts 1 for all data
      • true positive rate == false positive rate == 1
    • threshold == 1: model predicts 0 for all data
      • both true & false positive rates are 0
    • threshold in between 0 & 1: series of different false positive & true positive rates
  • Receiver Operating Characteristic (ROC) curve: the set of points we get when trying all possible thresholds
from sklearn.metrics import roc_curve
y_pred_prob = logreg.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show();
  • roc_curve(y_test, y_pred_prob)
    • 1st argument: actual labels, 2nd argument: predicted probabilities
    • fpr: false positive rate, tpr: true positive rate
  • logreg.predict_proba(X_test)[:,1]
    • returns array with 2 columns
      • each column contains the probabilities for the respective target values
    • we choose the second column (index=1)
      • the probabilities of the predicted labels being 1

Untitled

Area under the ROC curve (AUC)

  • larger area under the ROC curve —> better model
from sklearn.metrics import roc_auc_score
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state = 42)
logreg.fit(X_train, y_train)
y_pred_prob = logreg.predict_proba(X_test)[:,1]
roc_auc_score(y_test, y_pred_prob)

AUC using cross-validation

from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')

Hyperparameter tuning

  • linear regression: choosing parameters
  • Ridge/lasso regression: choosing alpha
  • kNN: choosing n-neighbors

Hyperparameters: parameters that cannot be explicitly learned by fitting the model

  • (ex) alpha, n-neighbors, etc.
  • choosing the correct hyperparameter = hyperparameter tuning
    • essential to use cross-validation
  • Grid search cross-validation

Untitled

  1. try every combination of parameters in the grid
  2. fill up the grid
  3. choose the combination with best performance
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors': np.arange(1, 50)}
knn = KNeighborsClassifer()
knn_cv = GridSearchCV(knn, param_grid, cv = 5)
knn_cv.fit(X, y)
knn_cv.best_params_ #best parameter
knn_cv.best_score_ #best performance

Hold-out set for final evaluation

  • how well can the model perform on never seen data?
    • using all data for cross-validation is not ideal
  1. split data into training & hold-out set at the beginning
  2. perform grid search cross-validation on training set
  3. choose best hyperparameters & evaluate on hold-out set

Preprocessing and pipelines


  • dealing with categorical features
    • scikit-learn does not accept categorical features by default
    • we need to encode categorical features numerically → dummy variables
      • 0: observation was not the category
      • 1: observation was the category
  • Dummy variables

Untitled

Untitled

  • Dealing with categorical features in Python
    • scikit-learn: OneHotEncoder()
    • pandas: get_dummies()
import pandas as pd
df = pd.read_csv('auto.csv')
df_origin = pd.get_dummies(df)

df_origin = pd.get_dummies(df, drop_first=True)
df_origin = df_origin.drop('origin_Asia', axis=1)

Handling missing data

  • change all the missing data entries to 'NaN'
df.insulin.replace(0, np.nan, inplace=True)

Drop missing data

  • drawback: we will have to drop a lot of data
df = df.dropna()

Imputing missing data: making an educated guess about the missing values

  • (ex) using the mean of the non-missing entries
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X)
X = imp.transform(X)
  • Imputer(missing_values = 'NaN', strategy='mean', axis=0)
    • missing_values: missing values are represented by NaN
    • axis=0 : we will impute along columns
      • axis = 1: impute along row

Imputing within a pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
logreg = LogisticRegression()
steps = [('imputation', imp), ('logistic_regression', logreg)]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
pipeline.score(X_test, y_test)
  • steps: each step is a 2-tuple containing the name you wish to give the relevant step & estimator

Centering & Scaling

  • why scale your data?
    • many models use some form of distance to inform them
    • features on larger scales can unduly influence the model
    • (ex) KNN uses distance explicitly when making predictions
    • we want features to be on a similar scale → normalizing (centering & scaling)
  • ways to normalize data
    • standardization: subtract the mean and divide by variance
      • all features centered around 0 & have variance 1
    • subtract the minimum & divide by range
      • minimum 0 & maximum 1
    • can normalize so the data ranges [-1,+1]

Scaling in scikit-learn

from sklearn.preprocessing import scale
X_scaled = scale(X)

Scaling in a pipeline

from sklearn.preprocessing import StandardScaler
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
																													random_state=21)
knn_scaled = pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy_score(y_test, y_pred) #higher accuracy than unscaled version (below)

knn_unscaled = KNeighborsClassifer().fit(X_train, y_train)
knn_unscaled.score(X_test, y_test)

CV & scaling in a pipeline

steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
parameters = {knn__n_neighbors: np.arange(1, 50)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
cv = GridSearchCV(pipeline, param_grid = parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
profile
우당탕탕

0개의 댓글