[ML&DL] 8. Support Vector Machines

KBC·2024년 12월 14일
0

Support Vector Machines

  • Here we approach the two-class classification problem in a direct way :

    We try and find a plane that seperates the classes in feature space

  • If we cannot, we get creative in two ways :
    • We soften what we mean by seperates, and
    • We enrich and enlarge the feature space so that separation is possible

What is a Hyperplane?

  • A hyperplane in pp dimensions is a flat affine subspace of dimension p1p-1
  • In general the equation for a hyperplane has the form
    β0+β1X1++βpXp=0\beta_0+\beta_1X_1+\cdots+\beta_pX_p=0
  • In p=2p=2 dimensions a hyperplane is a line
  • If β0=0,\beta_0=0, the hyperplane goes throught the origin, otherwise not
  • The vector β=(β1,  β2,,βp)\beta=(\beta_1,\;\beta_2,\cdots,\beta_p) is called the normal vector - it points in a direction orthogonal to the surface of a hyperplane

Separating Hyperplanes

  • If f(X)=β0+β1X1++βpXpf(X)=\beta_0+\beta_1X_1+\cdots+\beta_pX_p, then f(X)>0f(X)>0 for points on one side of the hyperplane, and f(X)<0f(X)<0 for points on the other
  • If we code the colored points as Yi=±1Y_i=\pm1 for blue, say, and Yi=1Y_i=-1 for mauve, then if Yif(Xi)>0Y_i\cdot f(X_i)>0 for all ii, f(X)=0f(X)=0 defines a separating hyperplane
    d=β0+β1X1++βpXpβ12++βp2d=\frac{|\beta_0+\beta_1X_1+\cdots+\beta_pX_p|}{\sqrt{\beta_1^2+\cdots+\beta_p^2}}

Maximal Margin Classifier

  • Among all seperating hyperplanes, find the one that makes the biggest gap or margin between the two classes
  • Constrained optimization problem
    maximize Msubject to j=1pβj2=1yi(β0+β1xi1++βpxipM,for all i=1,,N\text{maximize M}\\[0.2cm] \text{subject to }\sum^p_{j=1}\beta^2_j=1\\[0.3cm] y_i(\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip}\geq M,\quad\text{for all }i=1,\dots,N

Non-separable Data

  • The data are not separable by a linear boundary
  • This is often the ase, unless N<pN<p

Noisy Data

  • Sometimes the data are separable, but noisy
  • This can lead to a poor solution for the maximal-margin classifier
  • The support vector classifier maximizes a soft margin

Support Vector Classifier

maximize M subject to j=1pβj2=1yi(β0+β1xi1++βpxip)M(1ϵi),ϵi0,  i=1nϵiC\text{maximize M subject to }\sum^p_{j=1}\beta^2_j=1\\[0.3cm] y_i(\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip})\geq M(1-\epsilon_i),\\[0.3cm] \epsilon_i\geq0,\;\sum^n_{i=1}\epsilon_i\leq C
  • ϵi\epsilon_i : slack variable
  • CC : Budget

Linear boundary can fail

  • Sometimes a linear boundary simply won't work, no matter what value of CC
  • The example on the left is such a case. What to do?

Featue Expansion

  • Enlarge the space of features by including transformations;
    e.g. X12,X13,X1X2,X1X22,X^2_1,X^3_1,X_1X_2,X_1X^2_2,\dots Hence go from a pp-dimensional space to a M>pM>p dimensional space
  • Fit a support-vector classifier in the enlarged space
  • This results in non-linear decision boundaries in the original space
  • Example : Suppose we use (X1,X2,X12,X22,X1X2)(X_1,X_2,X^2_1,X_2^2,X_1X_2) instead of just (X1,X2)(X_1,X_2). Then the decision boundary would be of the form
    β0+β1X1+β2X2+β3X12+β4X22+β5X1X2=0\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X^2_1+\beta_4X^2_2+\beta_5X_1X_2=0
  • This leads to nonlinear decision boundaries in the original space(quadratic conic sections)

Cubic Polynomials

  • Here we use a basis expansion of cubin polynomials
  • From 22 variables to 99
  • The support-vector classifier in the enlarged space solves the problem in the lower-dimensional space
β0+β1X1+β2X2+β3X12+β4X22+β5X1X2+β6X13+β7X23+β8X1X22+β9X12X2=0\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X^2_1+\beta_4X^2_2+\beta_5X_1X_2+\beta_6X^3_1+\beta_7X^3_2+\beta_8X_1X^2_2+\beta_9X^2_1X_2=0

Nonlinearities and Kernels

  • Polynomials (especially high-dimensional ones) get wild rather fast
  • There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers - through the use of kernels
  • Before we discuss these, we must understand the role of inner products in support-vector classifiers

Inner products and Support Vectors

<xi,xi>=j=1pxijxij:inner product between vectors<x_i,x_{i'}>=\sum^p_{j=1}x_{ij}x_{i'j} :\text{inner product between vectors}
  • The linear support vector classifier can be represented as
    f(x)=β0+i=1nαi<x,xi>  :n parametersf(x)=\beta_0+\sum^n_{i=1}\alpha_i<x,x_i>\;:\text{n parameters}
  • To estimate the parameters α1,,αn\alpha_1,\dots,\alpha_n and β0\beta_0, all we need are the (n2)\left(\begin{matrix}n\\2\end{matrix}\right) inner products <xi,xi><x_i,x_{i'}> between all pairs of training observations
  • It turns out that most of the α^i\hat \alpha_i can be zero :
    f(x)=β0+iSα^i<x,xi>f(x)=\beta_0+\sum_{i\in S}\hat \alpha_i<x,x_i>
  • SS is the support set of indies ii such that α^i>0\hat \alpha_i >0

Kernels and Support Vector Machines

  • If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract!
  • Some special kernel functions can do this for us
    K(xi,xi)=(1+j=1pxijxij)dK(x_i,x_{i'})=\left(1+\sum^p_{j=1}x_{ij}x_{i'j}\right)^d
    computes this inner-products needed for dd dimensional polynomials - (p+dd)\left(\begin{matrix}p+d\\d\end{matrix}\right) basis functions

    Try it for p=2p=2 and d=2d=2

  • The solution has the form
    f(x)=β0+iSα^iK(x,xi)f(x)=\beta_0+\sum_{i \in S}\hat \alpha_i K(x,x_i)

Radial Kernel

K(xi,xi)=exp(γj=1p(xijxij)2)K(x_i,x_{i'})=\exp\left(-\gamma\sum^p_{j=1}\left(x_{ij}-x_{i'j}\right)^2\right)

f(x)=β0+iSα^iK(x,xi)f(x)=\beta_0+\sum_{i\in S}\hat \alpha_i K(x,x_i)
  • Implict feature space; very high dimensional
  • Controls variance by squashing down most dimensions severely

Example : Heart Data

  • ROC Curve is obtained by changing the threshold 00 to threshold tt in f^(X)>t\hat f(X)>t, and recording false positive and true positive rates as tt varies
  • here we see ROC curves on training data

SVMs : more than 2 classes?

  • The SVM as defined works for K=2K=2 classes
  • What do we do if we have K>2K>2 classes?
    • OVA : One versus All.
      Fit KK different 2-class SVM classifiers f^k(x),  k=1,,K;\hat f_k(x),\;k=1,\dots,K; each class versus the rest
      Classify xx^* to the class for which f^k(x)\hat f_k(x^*)
    • OVO : One versus One
      Fit all (K2)\left(\begin{matrix}K\\2\end{matrix}\right) pairwise classifiers f^kl(x)\hat f_{kl}(x)
      Classify xx^* to the class that wins the most pairwise competitions
  • Which to choose? If KK is not too large, use OVO

Support Vector vs Logistic Regression?

  • With f(X)=β0+β1X1++βpXpf(X)=\beta_0+\beta_1X_1+\cdots+\beta_pX_p can rephrase support-vector classifier optimization as

    minβ0,,βp{i=1nmax[0,1yif(xi)]+λj=1pβj2}\min_{\beta_0,\dots,\beta_p}\left\{\sum^n_{i=1}\max\left[0,1-y_if(x_i)\right]+\lambda\sum^p_{j=1}\beta^2_j\right\}

  • This has the form loss plus penalty

  • The loss is known as the hinge loss very similar to loss in logistic regression (negative log-likelihood)

Which to use : SVM or Logistic Regression

  • When classes are (nearly) separable, SVM does better than LR. So does LDA
  • When not, LR (with ridge penalty) and SVM very similar
  • If you wish to estimate probabilities, LR is the choice
  • For nonlinear boundaries, kernel SVMs are popular
  • Can use kernels with LR and LDA as well, but computations are more expensive

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

profile
AI, Security

0개의 댓글