Supervised Learning with scikit-learn

지니🧸·2022년 10월 17일

scikit learn 딥러닝 머신러닝

ML, DL, etc.

목록 보기

1/3

Supervised Learning with scikit-learn

Classification

Supervised Learning

Machine Learning: the art and science of giving computers the ability to learn to make decisions from data w/o being explicitly programmed
- (ex) learning to predict whether an email is spam, clustering Wikipedia into different categories, etc.
Unsupervised learning: uses unlabeled data
- uncovering hidden patterns from unlabeled data
- (ex) clustering: grouping customers into distinct categories
Reinforcement learning
- software agents interact with an environment
  - learn how to optimize their behavior
  - given a system of rewards & punishments
  - draws inspiration from behavioral psychology
Supervised learning: uses labeled data
- predictor variables/features and a target variable
- aim: predict the target variable, given the predictor variables
  - classification: target variable consists of categories
  - regression: target variable is continuous
- applications
  - automate time-consuming or expensive manual tasks
  - make predictions about the future
  - need labeled data

Exploratory data analysis

The Iris dataset in scikit-learn

from scklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
iris = datasets.load_iris()

type(iris) #sklearn.datasets.base.Bunch
print(iris.keys()) 
type(iris.data), type(iris.target)
iris.data.shape
iris.target_names

x = iris.data
y = iris.target
df = pd.DataFrame(x, columns=iris.feature_names)

Visual EDA

_ = pd.plotting.scatter_matrix(df, c = y, figsize = [8,8], 
													s= 150, marker='D')

c: color (data points in the figure will be colored by this value)
figsize: size of figure

Result

Untitled

diagonal: histograms of the features corresponding to row & column
off-diagonal: scatter plots of the column feature vs. row feature colored by target variable

The Classification Challenge

K-Nearest Neighbors (KNN)

predicts the label of a data point by looking at the 'k' closest labeled data points
- the data points vote on what label the unlabeled point should have

Scikit-learn fit and predict

all ML models implemented as Python classes
- they implement the algorithms for learning & predicting
- store the information learned from the data
training a model on the data = 'fitting' a model to the data
- .fit() method

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(iris['data'], iris['target'])

data should be a NumPy array or pandas DataFrame
features should be continuous values (not categories)
there should be no missing values in the data
each column is a feature & each row is a data point
in fitting method, first data should be features & second should be target variable

x_new = np.array([[5.6, 2.5, 3.1, 2.6], [5.7, 2.6, 3.1, 2.6],
			[1.5, 2.7, 4.1, 2.7]])
prediction = knn.predict(x_new)

prediction: 3 by 1 array with a prediction for each observation in x_new

Measuring model performance

accuracy is a commonly used metric in classification
- accuracy = fraction of correct predictions

Procedure

split data into training & test set
fit/train the classifier on the training set
make predictions on test set
compare predictions with the known labels

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
																								random_state=21, stratify=y)

train_test_split()
- 1st argument: feature data, 2nd argument: targets or labels
- test_size: proportion of the original data to be used for the test set
- random_state: sets a seed for the random number generator that splits the data into train & test
- stratify: should equal label to ensure that labels in train & test sets are as they are in the original dataset
- returns 4 arrays: training data, test data, training labels, test labels
- by default, test data: 25%, training data: 75%

knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)
prediction = knn.predict(X_test)
knn.score(X_test, y_test)

Model Complexity

larger k = smoother decision boundary = less complex model
complex models run the risk of being sensitive to noise in the data, rather than reflecting trend —> overfitting
too big K value —> underfitting

Regression

Introduction to Regression

target value: continuous value

boston = pd.read_csv('boston.csv')
X = boston.drop('MEDV', axis=1).values #drop the target column
y = boston['MEDV'].values

#predicting house value from a single feature
X_rooms = X[:, 5]
y = y.reshape(-1, 1)
X_rooms = X_rooms.reshape(-1, 1)

Fitting a regression model

import numpy as np
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_rooms, y)
prediction_space = np.linspace(min(X_rooms), max(X-rooms)).reshape(-1,1)

Basics of Linear Regression

define an error function for any given line & choose the line that minimizes the error function
- aka lost function (cost function)
we minimize the sum of squares of residual
- Ordinary least squares (OLS): minimize sum of squares of residuals
Linear regression higher dimensions
- must specify coefficient for each feature (x) & the variable b

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
reg_all.score(X_test, y_test)

Cross-validation

cross-validation motivation
- model performance is dependent on way the data is split
- not representative of model's ability to generalize
- so, we use cross-validation to avoid the problem of the chosen metric being dependent on the train test split
cross-validation basics
- split the dataset into 5 groups (folds)
- hold out the first fold as a test set & fit the model on the remaining 4 folds
- predict on the test
- compute the metric of interest
- repeat for the second fold, third fold.. etc.
interpreting cross-validation
- as a result of cross-validation, we get 5 values of R squared from which we compute statistics of interests (mean, median, 95% confidence intervals)
cross-validation & model performance
- split into five folds - 5-fold cross validation
- split into 10 folds - 10-fold cross validation
- split into k folds - k-fold cross validation (CV)
- trade-off of using more folds: more computationally expensive

Cross-validation in scikit-learn

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv = 5)

cross_val_score(regressor, feature data, target data, cv = #)
- cv: specifies number of folds
- returns an array of cross-validation scores
  - length of array = # of folds utilized
  - score is R-squared (by default)

Regularized regression

why regularize?
- if linear regression occurs in a high-dimensional space with large coefficients, it may lead to overfitting
- we should penalize large coefficients —> Regularization

Ridge regression

loss function = OLS loss function +
- models are penalized for coefficients w/ a large magnitude
- alpha: parameter we need to choose
  - controls complexity
  - alpha = 0: OLS (possibly overfitting)
  - high alpha: large coefficients are significantly penalized (underfitting)

from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_tst = train_test_split(X, y, test_size=0.3, 
																													random_state=42)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train, y_train)
ridge_pred=ridge.predict(X_test)
ridge.score(X_test, y_test)

Ridge(alpha = #, normalize = True/False)
- alpha: alpha value
- normalize = True: all variables are on the same scale

Lasso regression

loss function = OLS loss function +
can be used to select important features of a dataset
- shrinks the coefficients of less important features to 0
- others are selected by the algorithms

from sklearn.linear_model import Lasso
X_train, X_test, y_train, y_tst = train_test_split(X, y, test_size=0.3 random_state=42)
lasso = Lasso(alpha=0.1, normalize=True)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
lasso.score(X_test, y_test)

Feature selection of Lasso regression

from sklearn.linear_model import Lasso
names = boston.drop('MEDV', axis=1).columns
lasso = Lasso(alpha=0.1)
lasso_coef = lasso.fit(X, y).coef_
_ = plt.plot(range(len(names)), lass_coef)
_ = plt.xticks(range(len(names)), names, rotation=60)
_ = plt.ylabel('Coefficients')
plt.show()

Untitled

Fine-tuning your model

How good is your model?

accuracy may not be the best measure
ways to diagnose classification predictions
- Confusion matrix: 2-by-2 matrix that summarizes predictive performance (given binary classifier)
  - top left & bottom right are correctly labeled (True)
  - class of interest: positive class (i.e. spam)
  - accuracy = (sum of the diagonal) / (total sum of the matrix)
- Metrics from the confusion matrix
  - precision = (number of true positives) / (total number of true positives and false positives)
    - aka positive predictive value (PPV)
    - high precision = low false positive rate: low false positive rate
      - (ex) not many real emails were predicted being spam
  - recall = (number of true positives) / (total number of true positives and false negatives)
    - aka sensitivity, hit rate, true positive rate
    - high recall: classifier predicted most positives correctly
  - F1 score = 2 (precision recall) / (precision + recall)
    - aka harmonic mean of precision and recall

Confusion matrix in scikit-learn

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
knn = KNeighborsClassifier(n_neighbors=8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state = 42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

classification_report(y_test, y_pred)
- 1st argument: true label, 2nd argument: prediction

Logistic regression and the ROC curve

Logistic regression for binary classification
- logistic regression outputs probabilities
  - given 1 feature, log reg outputs a probability p with respect to the target variable
  - p > 0.5 → data labeled as 1
  - p < 0.5 → data labeled as 0
- log reg produces a linear decision boundary

Untitled

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state = 42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

by default, logistic regression threshold = 0.5
thresholds affect true & false positive rates
- threshold == 0: model predicts 1 for all data
  - true positive rate == false positive rate == 1
- threshold == 1: model predicts 0 for all data
  - both true & false positive rates are 0
- threshold in between 0 & 1: series of different false positive & true positive rates
Receiver Operating Characteristic (ROC) curve: the set of points we get when trying all possible thresholds

from sklearn.metrics import roc_curve
y_pred_prob = logreg.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show();

roc_curve(y_test, y_pred_prob)
- 1st argument: actual labels, 2nd argument: predicted probabilities
- fpr: false positive rate, tpr: true positive rate
logreg.predict_proba(X_test)[:,1]
- returns array with 2 columns
  - each column contains the probabilities for the respective target values
- we choose the second column (index=1)
  - the probabilities of the predicted labels being 1

Untitled

Area under the ROC curve (AUC)

larger area under the ROC curve —> better model

from sklearn.metrics import roc_auc_score
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state = 42)
logreg.fit(X_train, y_train)
y_pred_prob = logreg.predict_proba(X_test)[:,1]
roc_auc_score(y_test, y_pred_prob)

AUC using cross-validation

from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')

Hyperparameter tuning

linear regression: choosing parameters
Ridge/lasso regression: choosing alpha
kNN: choosing n-neighbors

Hyperparameters: parameters that cannot be explicitly learned by fitting the model

(ex) alpha, n-neighbors, etc.
choosing the correct hyperparameter = hyperparameter tuning
- essential to use cross-validation
Grid search cross-validation

Untitled

try every combination of parameters in the grid
fill up the grid
choose the combination with best performance

from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors': np.arange(1, 50)}
knn = KNeighborsClassifer()
knn_cv = GridSearchCV(knn, param_grid, cv = 5)
knn_cv.fit(X, y)
knn_cv.best_params_ #best parameter
knn_cv.best_score_ #best performance

Hold-out set for final evaluation

how well can the model perform on never seen data?
- using all data for cross-validation is not ideal

split data into training & hold-out set at the beginning
perform grid search cross-validation on training set
choose best hyperparameters & evaluate on hold-out set

Preprocessing and pipelines

dealing with categorical features
- scikit-learn does not accept categorical features by default
- we need to encode categorical features numerically → dummy variables
  - 0: observation was not the category
  - 1: observation was the category
Dummy variables

Untitled

Dealing with categorical features in Python
- scikit-learn: OneHotEncoder()
- pandas: get_dummies()

import pandas as pd
df = pd.read_csv('auto.csv')
df_origin = pd.get_dummies(df)

df_origin = pd.get_dummies(df, drop_first=True)
df_origin = df_origin.drop('origin_Asia', axis=1)

Handling missing data

change all the missing data entries to 'NaN'

df.insulin.replace(0, np.nan, inplace=True)

Drop missing data

drawback: we will have to drop a lot of data

df = df.dropna()

Imputing missing data: making an educated guess about the missing values

(ex) using the mean of the non-missing entries

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X)
X = imp.transform(X)

Imputer(missing_values = 'NaN', strategy='mean', axis=0)
- missing_values: missing values are represented by NaN
- axis=0 : we will impute along columns
  - axis = 1: impute along row

Imputing within a pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
logreg = LogisticRegression()
steps = [('imputation', imp), ('logistic_regression', logreg)]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
pipeline.score(X_test, y_test)

steps: each step is a 2-tuple containing the name you wish to give the relevant step & estimator

Centering & Scaling

why scale your data?
- many models use some form of distance to inform them
- features on larger scales can unduly influence the model
- (ex) KNN uses distance explicitly when making predictions
- we want features to be on a similar scale → normalizing (centering & scaling)
ways to normalize data
- standardization: subtract the mean and divide by variance
  - all features centered around 0 & have variance 1
- subtract the minimum & divide by range
  - minimum 0 & maximum 1
- can normalize so the data ranges [-1,+1]

Scaling in scikit-learn

from sklearn.preprocessing import scale
X_scaled = scale(X)

Scaling in a pipeline

from sklearn.preprocessing import StandardScaler
steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
																													random_state=21)
knn_scaled = pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy_score(y_test, y_pred) #higher accuracy than unscaled version (below)

knn_unscaled = KNeighborsClassifer().fit(X_train, y_train)
knn_unscaled.score(X_test, y_test)

CV & scaling in a pipeline

steps = [('scaler', StandardScaler()), ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
parameters = {knn__n_neighbors: np.arange(1, 50)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
cv = GridSearchCV(pipeline, param_grid = parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

지니🧸

우당탕탕

다음 포스트

Supervised Learning with scikit-learn

ML, DL, etc.

Supervised Learning with scikit-learn

Classification

Regression

Fine-tuning your model

Preprocessing and pipelines

Unsupervised Learning in Python

0개의 댓글