[ML&DL] 8. Support Vector Machines

KBC·2024년 12월 14일

Machine Learning and Deep Learning

목록 보기

8/11

Here we approach the two-class classification problem in a direct way :

We try and find a plane that seperates the classes in feature space
If we cannot, we get creative in two ways :
- We soften what we mean by seperates, and
- We enrich and enlarge the feature space so that separation is possible

A hyperplane in $p$ dimensions is a flat affine subspace of dimension $p-1$
In general the equation for a hyperplane has the form $\beta_0+\beta_1X_1+\cdots+\beta_pX_p=0$
In $p=2$ dimensions a hyperplane is a line
If $\beta_0=0,$ the hyperplane goes throught the origin, otherwise not
The vector $\beta=(\beta_1,\;\beta_2,\cdots,\beta_p)$ is called the normal vector - it points in a direction orthogonal to the surface of a hyperplane

If $f(X)=\beta_0+\beta_1X_1+\cdots+\beta_pX_p$ , then $f(X)>0$ for points on one side of the hyperplane, and $f(X)<0$ for points on the other
If we code the colored points as $Y_i=\pm1$ for blue, say, and $Y_i=-1$ for mauve, then if $Y_i\cdot f(X_i)>0$ for all $i$ , $f(X)=0$ defines a separating hyperplane $d=\frac{|\beta_0+\beta_1X_1+\cdots+\beta_pX_p|}{\sqrt{\beta_1^2+\cdots+\beta_p^2}}$

Among all seperating hyperplanes, find the one that makes the biggest gap or margin between the two classes
Constrained optimization problem $\text{maximize M}\\[0.2cm] \text{subject to }\sum^p_{j=1}\beta^2_j=1\\[0.3cm] y_i(\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip}\geq M,\quad\text{for all }i=1,\dots,N$

\text{maximize M subject to }\sum^p_{j=1}\beta^2_j=1\\[0.3cm] y_i(\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip})\geq M(1-\epsilon_i),\\[0.3cm] \epsilon_i\geq0,\;\sum^n_{i=1}\epsilon_i\leq C

Enlarge the space of features by including transformations;
e.g. $X^2_1,X^3_1,X_1X_2,X_1X^2_2,\dots$ Hence go from a $p$ -dimensional space to a $M>p$ dimensional space
Fit a support-vector classifier in the enlarged space
This results in non-linear decision boundaries in the original space
Example : Suppose we use $(X_1,X_2,X^2_1,X_2^2,X_1X_2)$ instead of just $(X_1,X_2)$ . Then the decision boundary would be of the form $\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X^2_1+\beta_4X^2_2+\beta_5X_1X_2=0$
This leads to nonlinear decision boundaries in the original space(quadratic conic sections)

Here we use a basis expansion of cubin polynomials
From $2$ variables to $9$
The support-vector classifier in the enlarged space solves the problem in the lower-dimensional space

\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X^2_1+\beta_4X^2_2+\beta_5X_1X_2+\beta_6X^3_1+\beta_7X^3_2+\beta_8X_1X^2_2+\beta_9X^2_1X_2=0

Polynomials (especially high-dimensional ones) get wild rather fast
There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers - through the use of kernels
Before we discuss these, we must understand the role of inner products in support-vector classifiers

<x_i,x_{i'}>=\sum^p_{j=1}x_{ij}x_{i'j} :\text{inner product between vectors}

The linear support vector classifier can be represented as $f(x)=\beta_0+\sum^n_{i=1}\alpha_i<x,x_i>\;:\text{n parameters}$
To estimate the parameters $\alpha_1,\dots,\alpha_n$ and $\beta_0$ , all we need are the $\left(\begin{matrix}n\\2\end{matrix}\right)$ inner products $<x_i,x_{i'}>$ between all pairs of training observations
It turns out that most of the $\hat \alpha_i$ can be zero : $f(x)=\beta_0+\sum_{i\in S}\hat \alpha_i<x,x_i>$
$S$ is the support set of indies $i$ such that $\hat \alpha_i >0$

If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract!
Some special kernel functions can do this for us $K(x_i,x_{i'})=\left(1+\sum^p_{j=1}x_{ij}x_{i'j}\right)^d$ computes this inner-products needed for $d$ dimensional polynomials - $\left(\begin{matrix}p+d\\d\end{matrix}\right)$ basis functions

Try it for $p=2$ and $d=2$
The solution has the form $f(x)=\beta_0+\sum_{i \in S}\hat \alpha_i K(x,x_i)$

K(x_i,x_{i'})=\exp\left(-\gamma\sum^p_{j=1}\left(x_{ij}-x_{i'j}\right)^2\right)

f(x)=\beta_0+\sum_{i\in S}\hat \alpha_i K(x,x_i)

ROC Curve is obtained by changing the threshold $0$ to threshold $t$ in $\hat f(X)>t$ , and recording false positive and true positive rates as $t$ varies
here we see ROC curves on training data

With $f(X)=\beta_0+\beta_1X_1+\cdots+\beta_pX_p$ can rephrase support-vector classifier optimization as
$\min_{\beta_0,\dots,\beta_p}\left\{\sum^n_{i=1}\max\left[0,1-y_if(x_i)\right]+\lambda\sum^p_{j=1}\beta^2_j\right\}$
This has the form loss plus penalty
The loss is known as the hinge loss very similar to loss in logistic regression (negative log-likelihood)

When classes are (nearly) separable, SVM does better than LR. So does LDA
When not, LR (with ridge penalty) and SVM very similar
If you wish to estimate probabilities, LR is the choice
For nonlinear boundaries, kernel SVMs are popular
Can use kernels with LR and LDA as well, but computations are more expensive