Lecture 07. Kernels

cryptnomy·2022년 11월 23일
0

CS229: Machine Learning

목록 보기
7/18
post-thumbnail

Lecture video link: https://youtu.be/8NYoQiRANpg

SVM

  • Optimization problem
  • Representer theorem
  • Kernels
  • Example of kernels

Recap

Optimal margin classifier:

minw,b    12w2s.t.    y(i)(wTx(i)+b)1.\begin{aligned}\min_{w,b}\;\;&\frac{1}{2}||w||^2\\\text{s.t.}\;\;&y^{(i)}(w^Tx^{(i)}+b)\ge1.\end{aligned}
γ(i)=y(i)(wTx(i)+b)w    (geometric  margin).\gamma^{(i)}=\frac{y^{(i)}(w^Tx^{(i)}+b)}{||w||}\;\;(\text{geometric\;margin}).

Assume w=i=1mαix(i)w=\sum\limits_{i=1}^m \alpha_ix^{(i)}.

Why?

Intuition #1. Logistic regression

θi=0\theta_i=0

Gradient descent:

θθα(hθ(x(i))y(i))x(i)θθαi=1m(hθ(x(i))y(i))x(i)\theta\leftarrow\theta-\alpha(h_\theta(x^{(i)})-y^{(i)})x^{(i)}\\\theta\leftarrow\theta-\alpha\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x^{(i)}

Intuition#2. wdecision  boundaryw\perp\text{decision\;boundary}

The proof of representer theorem … refer to the lecture note.

w=i=1mαiy(i)x(i)w=\sum_{i=1}^m\alpha_iy^{(i)}x^{(i)}
minw,b    12w2s.t.    y(i)(wTx(i)+b)1.\begin{aligned}\min_{w,b}\;\;&\frac{1}{2}||w||^2\\\text{s.t.}\;\;&y^{(i)}(w^Tx^{(i)}+b)\ge1.\end{aligned}
min  12w2=min12(i=1mαiy(i)x(i))T(i=1mαiy(i)x(i))min  ijαiαjy(i)y(j)x(i)Tx(j)=min  ijαiαjy(i)y(j)<x(i),x(j)>s.t.    y(i)(jαjy(j)<x(j),x(i)>+b)1.\begin{aligned}&\min\;\frac{1}{2}||w||^2\\=&\min\frac{1}{2}\left(\sum_{i=1}^m\alpha_iy^{(i)}x^{(i)}\right)^T\left(\sum_{i=1}^m\alpha_iy^{(i)}x^{(i)}\right)\\\Rightarrow&\min\;\sum_i\sum_j\alpha_i\alpha_jy^{(i)}y^{(j)}{x^{(i)}}^Tx^{(j)}\\=&\min\;\sum_i\sum_j\alpha_i\alpha_jy^{(i)}y^{(j)}\left<x^{(i)},x^{(j)}\right>\\ \text{s.t.}\;\;&y^{(i)}\left(\sum_j\alpha_jy^{(j)}\left<x^{(j)},x^{(i)}\right>+b\right)\ge1. \end{aligned}

Dual optimization problem

max  iαi12ijy(i)y(j)αiαj<x(i),x(j)>s.t.    αi0,iy(i)αi=0.\begin{aligned}\max\;&\sum_i\alpha_i-\frac{1}{2}\sum_i\sum_jy^{(i)}y^{(j)}\alpha_i\alpha_j\left<x^{(i)},x^{(j)}\right>\\\text{s.t.}\;\;&\alpha_i\ge0,\\&\sum_iy^{(i)}\alpha_i=0. \end{aligned}
hw,b(x)=g(wTx+b)=g(iαiy(i)<x(i),x>+b)\begin{aligned}h_{w,b}(x)&=g(w^Tx+b)\\&=g\left(\sum_i\alpha_iy^{(i)}\left<x^{(i)},x\right>+b\right)\end{aligned}

Kernel trick:

1) Write algorithm in terms of <x(i),x(j)>\left<x^{(i)},x^{(j)}\right>.

2) Let that be mapping from xϕ(x)x\longmapsto\phi(x).

ex. [x1x2]ϕ(x)=[x1x2x1x2x12x2]\begin{bmatrix}x_1\\x_2\end{bmatrix}\longmapsto\phi(x)=\begin{bmatrix}x_1\\x_2\\x_1x_2\\x_1^2x_2\\\vdots\end{bmatrix}.

3) Find way to compute K(x,z)=ϕ(x)Tϕ(z)K(x,z)=\phi(x)^T\phi(z).

4) Replace <x,z>\left<x,z\right> in algorithm with K(x,z)K(x,z).

Andrew Ng thinks “no free lunch theorem” is a fascinating theoretical concept, but he finds that it’s been less useful because he thinks we have inductive biases that turn out to be useful.

cf. no free lunch theorem (famous in machine learning and optimization): basically says that in the worst case learning algorithms do not work. (ref. https://en.wikipedia.org/wiki/No_free_lunch_theorem)

… but universe is not that hostile toward us 😁 learning algorithms turned out to be okay.

ex 1.

x=[x1x2x3]Rnϕ(x)=[x1x1x1x2x1x3x2x1x2x2x2x3x3x1x3x2x3x3]Rn2,ϕ(z)=[z1z1z1z2z1z3z2z1z2z2z2z3z3z1z3z2z3z3]Rn2\begin{aligned}x&=\begin{bmatrix}x_1\\x_2\\x_3\end{bmatrix}\in\mathbb{R}^n\\\phi(x)&=\begin{bmatrix}x_1x_1\\x_1x_2\\x_1x_3\\x_2x_1\\x_2x_2\\x_2x_3\\x_3x_1\\x_3x_2\\x_3x_3\end{bmatrix}\in\mathbb{R}^{n^2}, \phi(z)=\begin{bmatrix}z_1z_1\\z_1z_2\\z_1z_3\\z_2z_1\\z_2z_2\\z_2z_3\\z_3z_1\\z_3z_2\\z_3z_3\end{bmatrix}\in\mathbb{R}^{n^2}\end{aligned}

n2n^2 elements need O(n2)O(n^2) time to compute ϕ(x)\phi(x) or ϕ(x)Tϕ(z)\phi(x)^T\phi(z) explicitly.

K(x,z)=ϕ(x)Tϕ(z)=(xTz)2K(x,z)=\phi(x)^T\phi(z)=(x^Tz)^2 O(n)\rightarrow O(n) time computation.

(xTz)2=(i=1nxizi)(i=1nxizi)=i=1ni=1nxizixjzj=i=1nj=1n(xixj)(zizj)\begin{aligned}(x^Tz)^2&=\left(\sum_{i=1}^nx_iz_i\right)\left(\sum_{i=1}^nx_iz_i\right)\\&=\sum_{i=1}^n\sum_{i=1}^nx_iz_ix_jz_j\\&=\sum_{i=1}^n\sum_{j=1}^n(x_ix_j)(z_iz_j)\end{aligned}

where we take xixjx_ix_j and zizjz_iz_j from ϕ(x)\phi(x) and ϕ(z)\phi(z).

ex 2.

K(x,z)=(xTz+c)2,    cRK(x,z)=(x^Tz+c)^2,\;\;c\in\mathbb{R}

where

x=[x1x2x3]Rnϕ(x)=[x1x1x1x2x1x3x2x1x2x2x2x3x3x1x3x2x3x32cx12cx22cx3c]Rn2,ϕ(z)=[z1z1z1z2z1z3z2z1z2z2z2z3z3z1z3z2z3z32cz12cz22cz3c]Rn2.\begin{aligned}x&=\begin{bmatrix}x_1\\x_2\\x_3\end{bmatrix}\in\mathbb{R}^n\\\phi(x)&=\begin{bmatrix}x_1x_1\\x_1x_2\\x_1x_3\\x_2x_1\\x_2x_2\\x_2x_3\\x_3x_1\\x_3x_2\\x_3x_3\\\sqrt{2c}x_1\\\sqrt{2c}x_2\\\sqrt{2c}x_3\\c\end{bmatrix}\in\mathbb{R}^{n^2}, \phi(z)=\begin{bmatrix}z_1z_1\\z_1z_2\\z_1z_3\\z_2z_1\\z_2z_2\\z_2z_3\\z_3z_1\\z_3z_2\\z_3z_3\\\sqrt{2c}z_1\\\sqrt{2c}z_2\\\sqrt{2c}z_3\\c\end{bmatrix}\in\mathbb{R}^{n^2}.\end{aligned}

ex 3.

K(x,z)=(xTz+c)d,    cRK(x,z)=(x^Tz+c)^d,\;\;c\in\mathbb{R}

O(n)O(n) time.

ϕ(x)\phi(x) has all (n+dd)\begin{pmatrix}n+d\\d\end{pmatrix} features of polynomials up to order dd.

Optimal margin classifier + Kernel tricks = SVM

Nice visualization of how SVM works:

SVM with a polynomial Kernel visualization created by Udi Aharoni

Video link: https://youtu.be/3liCbRZPrZA

SVM in short:

Map the data to a much higher dimensional feature space, and find a linear decision boundary, or a hyperplane in the high dimensional feature space. And then project it to the original feature space, you would obtain a non-linear(linear if lucky) decision boundary.

Q. Are always features in higher dimensional feature space linearly separable?

A. So far we’re pretending that it does, but we will come back and fix that assumption later this lecture.

How to make kernels

If x,zx, z are “similar,” K(x,z)=ϕ(x)Tϕ(z)K(x,z)=\phi(x)^T\phi(z) is “large.”

If x,zx, z are “dissimilar,” K(x,z)=ϕ(x)Tϕ(z)K(x,z)=\phi(x)^T\phi(z) is “small.”

(Think about inner product property)

K(x,z)=exp(xz22σ2)K(x,z)=\exp\left(-\frac{||x-z||^2}{2\sigma^2}\right)

called Gaussian kernel (the most widely used kernel).

Let {x(1),,x(d)}\{x^{(1)},\cdots,x^{(d)}\} be dd points and KRd×dK\in\mathbb{R}^{d\times d} be kernel matrix where Kij=K(x(i),x(j))K_{ij}=K(x^{(i)},x^{(j)}).

Given any vector zz,

zTKz=ijziKijzj=ijziϕ(x(i))Tϕ(x(j))zj=ijzik(ϕ(x(i)))k(ϕ(x(j)))kzj=ijkzi(ϕ(x(i)))k(ϕ(x(j)))kzj=k(iziϕ(x(i))k)20.\begin{aligned}z^TKz&=\sum_i\sum_jz_iK_{ij}z_j\\&=\sum_i\sum_jz_i\phi(x^{(i)})^T\phi(x^{(j)})z_j\\&=\sum_i\sum_jz_i\sum_k(\phi(x^{(i)}))_k(\phi(x^{(j)}))_kz_j\\&=\sum_i\sum_j\sum_kz_i(\phi(x^{(i)}))_k(\phi(x^{(j)}))_kz_j\\&=\sum_k\left(\sum_iz_i\phi(x^{(i)})_k\right)^2\\&\ge0.\end{aligned}

K0\Rightarrow K\ge0 (positive semi-definite).

Theorem (Mercer)

KK is a valid kernel function (i.e.(\text{i.e.}   ϕ    s.t.  K(x,z)=ϕ(x)Tϕ(z))\exist\;\phi\;\;\text{s.t.}\;K(x,z)=\phi(x)^T\phi(z)) if and only if for any dd points {x(1),,x(d)}\{x^{(1)}, \cdots, x^{(d)}\}, the corresponding kernel matrix K0K\ge0.

Linear kernel: K(x,z)=xTz,  ϕ(x)=x.K(x,z)=x^Tz,\;\phi(x)=x.

Gaussian kernel: K(x,z)=exp(xz22σ2),  ϕ(x)RK(x,z)=\exp\left(-\frac{||x-z||^2}{2\sigma^2}\right),\;\phi(x)\in\mathbb{R}^\infty.

Note:

All of the learning algorithms we’ve learned so far can be written in linear product so that you can apply the kernel trick.

e.g. linear regression, logistic regression, everything of the generalized linear model family, the perceptron algorithm, etc.

L1**L_1 norm soft margin SVM:**

minw,b    12w2+ci=1mξis.t.    y(i)(wTx(i)+b)1ξi,    i=1,,m,  ξi0.\begin{aligned}\min_{w,b}\;\;&\frac{1}{2}||w||^2+c\sum_{i=1}^m\xi_i\\\text{s.t.}\;\;&y^{(i)}(w^Tx^{(i)}+b)\ge1-\xi_i,\;\;i=1,\cdots,m,\;\\&\xi_i\ge0.\end{aligned}

Why use L1L_1 norm soft margin SVM?

Imagine the situation where we add one training example with label “x” but close to training examples labeled “o.” The basic optimal margin classifier will allow the presence of one training example to cause the decision boundary to swing dramatically because the original optimal margin classifier optimizes for the worst-case margin. → can have huge impact on the decision boundary.

However, L1L_1 norm soft margin SVM enables you to have the decision boundary still closer to the original one even when there’s an outlier.

max  iαi12ijy(i)y(j)αiαj<x(i),x(j)>s.t.    0αic,iy(i)αi=0,    i=1,,m.\begin{aligned}\max\;&\sum_i\alpha_i-\frac{1}{2}\sum_i\sum_jy^{(i)}y^{(j)}\alpha_i\alpha_j\left<x^{(i)},x^{(j)}\right>\\\text{s.t.}\;\;&0\le\alpha_i\le c,\\&\sum_iy^{(i)}\alpha_i=0,\;\;i=1,\cdots,m. \end{aligned}

K(x,z)=(xTz)d=exp(xz22σ2)K(x,z)=(x^Tz)^d=\exp\left(-\frac{||x-z||^2}{2\sigma^2}\right).

Protein sequence classifier

Suppose that there are 26 amino acids labeled by from A to Z although we all know there are actually 20 amino acids.

seq ex: BAJTSTAIBAJTAU…

ϕ(x)=?\phi(x)=?

ϕ(x)R204\phi(x)\in\mathbb{R}^{20^4}.

0개의 댓글