Linear Classifiers in Python

지니🧸·2022년 10월 17일

ML, DL, etc.

목록 보기

3/3

Applying logistic regression and SVM

Scikit-learn refresher

import sklearn.datasets
newsgroups = sklearn.datasets.fetch_20newgroups_vectorized()
X, y = newsgroups.data, newsgroups.target

X.shape
y.shape

from sklearn.neighbors import KNeighborsClassifer
knn = KNeighborsClassifer(n_neighbors = 1)
knn.fit(X, y)
y_pred = knn.predict(X)

Model evaluation

knn.score(X, y) #not meaningful, have to look at prediction on unseen data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

Applying logistic regression and SVM

Using LogisticRegression

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.predict(X_test)
lr.score(X_test, y_test)
lr.predict_proba(X_train[:1])

LinearSVC

import sklearn.datasets
wine = sklearn.datasets.load_wine()
from sklearn.svm import LinearSVC
**svm = LinearSVC()**
svm.fit(wine.data, wine.target)
svm.score(wine.data, wine.target)

SVC (fits non-linear datasets by default)

import sklearn.datasets
wine = sklearn.datasets.load_wine()
**from sklearn.svm import SVC
svm = SVC()** #default hyperparameters
svm.fit(wine.data, wine.target)
svm.score(wine.data, wine.target) #score: 1 (can be overfitting)

more complex models like nonlinear SVMs contain the risk of classifier overfitting

Complexity review

underfitting: model is too simple, low training accuracy
overfitting: model is too complex, low test accuracy

Linear decision boundaries

Decision boundary: tells us what class our classifier will predict for any value of x

classifier predicts the blue class in the blue shaded area
- blue shaded area: feature 2 is small
classifier predicts the red class in the red shaded area
- red shaded area: feature 2 is large
decision boundary: dividing line between the two regions
- line can be in any orientation
- in this specific case, it is linear since it is horizontal
in basic forms, logistic regression & SVMs are linear classifiers
- they learn linear decision boundaries

Vocabulary

classification: supervised learning when the y-values are categories
- in contrast w/ regression (predicting continuous values)
decision boundary: the surface separating different predicted classes
linear classifier: a classifier that learns linear decision boundaries
- (ex) logistic regression, linear SVM
linearly separable: a data set can be perfectly explained by a linear classifier

left figure: no single line that separates the red and blue examples
right figure: we could divide 2 classes w/ a straight line → linearly separable

Loss Functions

Linear classifiers: the coefficients

Dot product

x = np.arange(3) #array([0, 1, 2])
y = np.arange(3, 6) #array([3, 4, 5])
x*y #[(0*3), (1*4),(2*5)] -> array([0, 4, 10])
np.sum(x*y) #0+4+10 = 14
x@y #same as above = 14

Linear classifier prediction

raw model output = coefficients x features + intercept
linear classifier prediction: compute raw model output, check the sign
- if positive, predict one class
- if negative, predict the other class
this is the same for logistic regression & linear SVM
- .fit() is different but .predict() is the same
- differences in .fit() relate to loss functions

lr = LogisticRegression()
lr.fit(X, y)
lr.predict(X)[10] #0
lr.predict(X)[20] #1
lr.coef_ @ X[10] + lr.intercept_ #raw model output -> array([-33.78572166]) -> negative value -> 0
lr.coef_ @ X[20] + lr.intercept_ # -> array([0.08050621]) -> positive value -> 1

What is a loss function?

Least squares: the squared loss

scikit-learn’s LinearRegression minimizes a loss:

$\sum_{i=1}^n\textrm{(true }i\textrm{th target value }- \textrm{predicted }i\textrm{th target value)}^2$

minimizes sum of squares of errors made on training set
error is defined as the difference b/w the true target value & the predicted target value
jiggle around the coefficients (parameters) until the error term (loss function) is small as possible
- minimization in coefficients/parameters is to be reached
loss function is a penalty score that tells us how well/bad the model is doing on the training data
“fit” function as running code that minimizes the loss
scikit-learn model.score() isn’t necessarily the loss function
- could be, but not guaranteed

Classification errors: the 0-1 loss

Squared loss is not appropriate for classification problems
- b/c y-values are categories, not numbers
a natural loss for classification problem: number of errors
0-1 loss:
- 0 for a correct prediction
- 1 for incorrect prediction
by summing this function over all training examples, we get the number of mistakes we’ve made on the training set
- since we add 1 to the total for each mistake
but the loss is hard to minimize!
- thus LR & SVMs don’t use it

Minimizing a loss

from scipy.optimize import minimize
minimize(np.square, 0).x #result: 0

minimize(function, initial guess).x
- 1st: function
- 2nd: initial guess
- .x : grab the input value that makes the function as small as possible
- result is 0 for the above code b/c the function is minimized when x = 0
  - the square of a number can only be zero or more
    - smallest possible value is attained when x = 0

minimize(np.square, 2).x #array([-1.88846401e-08])

the very small number is normal for numerical optimization:
- we don’t expect exactly the right answer, but something very close
inputs: model coefficients
to answer the question: “what values of the model coefficients make my squared error as small as possible?”
- what linear regression is doing

Loss Function Diagrams

The raw model output

Since we predict using the sign of the raw model output, the plot is divided into 2 halves
- the left half: predict the one class (-1)
  - incorrect predictions
- the right half: predict the other class (+1)
  - correct predictions

0-1 loss diagram

By definition of 0-1 loss, incorrect predictions get a penalty of 1 & correct ones get no penalty
this picture is the loss for a particular training
- to get the whole loss, we need to sum up the contribution from all examples

Linear regression loss diagram

squared/quadratic function
the raw model output is the prediction
intuitively, the loss is higher as the prediction is further away from the true target value (1)
problem: the left side is correct (increasing loss as value further away from 1) but the right side is not
- in the right side, we predict +1 & it is correct but the loss grows large regardless
- perfectly good models are considered bad by the loss
we need specialized loss functions for classification

Logistic loss diagram

used in logistic regression
smoother version
as you move to the right (towards the zone of correct predictions), loss goes down

Hinge loss

used in SVMs

Logistic Regression

Logistic regression and regularization

regularization combats overfitting by making the model coefficients smaller

The figure shows the learned coefficients of a logistic regression model w/ default regularization
In scikit-learn, the hyperparameter “C” is the inverse of the regularization strength
- larger C → less regularization
- smaller C → more regularization

orange curve: with smaller value of C
- more regularization for our logistic regression model
regularization makes the coefficients smaller

How does regularization affect training accuracy?

lr_weak_reg = LogisticRegression(C=100) #weak regularization
lr_strong_reg = LogisticRegression(C=0.01) #strong regularization

#fit both models
lr_weak_reg.fit(X_train, y_train)
lr_strong_reg.fit(X_train, y_train)

#compute training accuracy
lr_weak_reg.score(X_train, y_train) #weak regularization -> higher training accuracy
lr_strong_reg.score(X_train, y_train)

model w/ weak regularization gets a higher training accuracy
regularization: an extra term added to the original loss function, which penalizes large values of the coefficients

$\textrm{regularized loss }=\textrm{original loss }+\textrm{ large coefficient penalty}$

more regularization → lower training accuracy
w/o regularization, we maximize the training accuracy
- do better on the metric
when we add regularization, we modify the loss function to penalize large coefficients, which distracts from the goal of optimizing accuracy
more regularization (smaller C)
→ more deviation from goal of maximizing training accuracy
→ lower training accuracy

How does regularization affect test accuracy?

lr_weak_reg.score(X_test, y_test) #0.86
lr_strong_reg.score(X_test, y_test) #0.88

more regularization reduces training accuracy but IMPROVES test accuracy
not having access to a particular feature ⇒ the corresponding coefficient set to zero
regularizing (making your coefficient smaller) is like a compromise b/w not using the feature at all (setting the coefficient to zero) & fully using it (the un-regularized coefficient value)
using a feature too heavily → overfitting
- regularization lessens overfitting

L1 vs. L2 regularization

Lasso: linear regression w/ L1 regularization
Ridge: linear regression w/ L2 regularization
for other models like logistic regression we just say L1, L2, etc.
both help reduce overfitting
L1 performs feature selection

lr_L1 = LogisticRegression(penalty='l1')
lr_L2 = LogisticRegression() #penalty='12' by default

lr_L1.fit(X_train, y_train)
lr_L2.fit(X_train, y_train)

plt.plot(lr_L1.coef_.flatten())
plt.plot(lr_L2.coef_.flatten())

L1 regularization: set many of the coefficients to zero
- ignore all the coefficients
- it performed feature selection for us
L2 regularization: shrinks the coefficients to be smaller
- analogous to what happens w/ Lasso & Ridge regression

지니🧸

우당탕탕

이전 포스트

Linear Classifiers in Python

ML, DL, etc.

Applying logistic regression and SVM

Loss Functions

Logistic Regression

Unsupervised Learning in Python

0개의 댓글