Lecture 07. Kernels

cryptnomy·2022년 11월 23일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

7/18

Lecture video link: https://youtu.be/8NYoQiRANpg

SVM

Optimization problem
Representer theorem
Kernels
Example of kernels

Recap

Optimal margin classifier:

\begin{aligned}\min_{w,b}\;\;&\frac{1}{2}||w||^2\\\text{s.t.}\;\;&y^{(i)}(w^Tx^{(i)}+b)\ge1.\end{aligned}

\gamma^{(i)}=\frac{y^{(i)}(w^Tx^{(i)}+b)}{||w||}\;\;(\text{geometric\;margin}).

Assume $w=\sum\limits_{i=1}^m \alpha_ix^{(i)}$ .

Why?

Intuition #1. Logistic regression

$\theta_i=0$

Gradient descent:

$\theta\leftarrow\theta-\alpha(h_\theta(x^{(i)})-y^{(i)})x^{(i)}\\\theta\leftarrow\theta-\alpha\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$

Intuition#2. $w\perp\text{decision\;boundary}$

The proof of representer theorem … refer to the lecture note.

w=\sum_{i=1}^m\alpha_iy^{(i)}x^{(i)}

\begin{aligned}\min_{w,b}\;\;&\frac{1}{2}||w||^2\\\text{s.t.}\;\;&y^{(i)}(w^Tx^{(i)}+b)\ge1.\end{aligned}

\begin{aligned}&\min\;\frac{1}{2}||w||^2\\=&\min\frac{1}{2}\left(\sum_{i=1}^m\alpha_iy^{(i)}x^{(i)}\right)^T\left(\sum_{i=1}^m\alpha_iy^{(i)}x^{(i)}\right)\\\Rightarrow&\min\;\sum_i\sum_j\alpha_i\alpha_jy^{(i)}y^{(j)}{x^{(i)}}^Tx^{(j)}\\=&\min\;\sum_i\sum_j\alpha_i\alpha_jy^{(i)}y^{(j)}\left<x^{(i)},x^{(j)}\right>\\ \text{s.t.}\;\;&y^{(i)}\left(\sum_j\alpha_jy^{(j)}\left<x^{(j)},x^{(i)}\right>+b\right)\ge1. \end{aligned}

Dual optimization problem

\begin{aligned}\max\;&\sum_i\alpha_i-\frac{1}{2}\sum_i\sum_jy^{(i)}y^{(j)}\alpha_i\alpha_j\left<x^{(i)},x^{(j)}\right>\\\text{s.t.}\;\;&\alpha_i\ge0,\\&\sum_iy^{(i)}\alpha_i=0. \end{aligned}

\begin{aligned}h_{w,b}(x)&=g(w^Tx+b)\\&=g\left(\sum_i\alpha_iy^{(i)}\left<x^{(i)},x\right>+b\right)\end{aligned}

Kernel trick:

1) Write algorithm in terms of $\left<x^{(i)},x^{(j)}\right>$ .

2) Let that be mapping from $x\longmapsto\phi(x)$ .

ex. $\begin{bmatrix}x_1\\x_2\end{bmatrix}\longmapsto\phi(x)=\begin{bmatrix}x_1\\x_2\\x_1x_2\\x_1^2x_2\\\vdots\end{bmatrix}$ .

3) Find way to compute $K(x,z)=\phi(x)^T\phi(z)$ .

4) Replace $\left<x,z\right>$ in algorithm with $K(x,z)$ .

Andrew Ng thinks “no free lunch theorem” is a fascinating theoretical concept, but he finds that it’s been less useful because he thinks we have inductive biases that turn out to be useful.

cf. no free lunch theorem (famous in machine learning and optimization): basically says that in the worst case learning algorithms do not work. (ref. https://en.wikipedia.org/wiki/No_free_lunch_theorem)

… but universe is not that hostile toward us 😁 learning algorithms turned out to be okay.

ex 1.

\begin{aligned}x&=\begin{bmatrix}x_1\\x_2\\x_3\end{bmatrix}\in\mathbb{R}^n\\\phi(x)&=\begin{bmatrix}x_1x_1\\x_1x_2\\x_1x_3\\x_2x_1\\x_2x_2\\x_2x_3\\x_3x_1\\x_3x_2\\x_3x_3\end{bmatrix}\in\mathbb{R}^{n^2}, \phi(z)=\begin{bmatrix}z_1z_1\\z_1z_2\\z_1z_3\\z_2z_1\\z_2z_2\\z_2z_3\\z_3z_1\\z_3z_2\\z_3z_3\end{bmatrix}\in\mathbb{R}^{n^2}\end{aligned}

$n^2$ elements need $O(n^2)$ time to compute $\phi(x)$ or $\phi(x)^T\phi(z)$ explicitly.

$K(x,z)=\phi(x)^T\phi(z)=(x^Tz)^2$ $\rightarrow O(n)$ time computation.

\begin{aligned}(x^Tz)^2&=\left(\sum_{i=1}^nx_iz_i\right)\left(\sum_{i=1}^nx_iz_i\right)\\&=\sum_{i=1}^n\sum_{i=1}^nx_iz_ix_jz_j\\&=\sum_{i=1}^n\sum_{j=1}^n(x_ix_j)(z_iz_j)\end{aligned}

where we take $x_ix_j$ and $z_iz_j$ from $\phi(x)$ and $\phi(z)$ .

ex 2.

K(x,z)=(x^Tz+c)^2,\;\;c\in\mathbb{R}

where

\begin{aligned}x&=\begin{bmatrix}x_1\\x_2\\x_3\end{bmatrix}\in\mathbb{R}^n\\\phi(x)&=\begin{bmatrix}x_1x_1\\x_1x_2\\x_1x_3\\x_2x_1\\x_2x_2\\x_2x_3\\x_3x_1\\x_3x_2\\x_3x_3\\\sqrt{2c}x_1\\\sqrt{2c}x_2\\\sqrt{2c}x_3\\c\end{bmatrix}\in\mathbb{R}^{n^2}, \phi(z)=\begin{bmatrix}z_1z_1\\z_1z_2\\z_1z_3\\z_2z_1\\z_2z_2\\z_2z_3\\z_3z_1\\z_3z_2\\z_3z_3\\\sqrt{2c}z_1\\\sqrt{2c}z_2\\\sqrt{2c}z_3\\c\end{bmatrix}\in\mathbb{R}^{n^2}.\end{aligned}

ex 3.

K(x,z)=(x^Tz+c)^d,\;\;c\in\mathbb{R}

… $O(n)$ time.

$\phi(x)$ has all $\begin{pmatrix}n+d\\d\end{pmatrix}$ features of polynomials up to order $d$ .

Optimal margin classifier + Kernel tricks = SVM

Nice visualization of how SVM works:

SVM with a polynomial Kernel visualization created by Udi Aharoni

Video link: https://youtu.be/3liCbRZPrZA

SVM in short:

Map the data to a much higher dimensional feature space, and find a linear decision boundary, or a hyperplane in the high dimensional feature space. And then project it to the original feature space, you would obtain a non-linear(linear if lucky) decision boundary.

Q. Are always features in higher dimensional feature space linearly separable?

A. So far we’re pretending that it does, but we will come back and fix that assumption later this lecture.

How to make kernels

If $x, z$ are “similar,” $K(x,z)=\phi(x)^T\phi(z)$ is “large.”

If $x, z$ are “dissimilar,” $K(x,z)=\phi(x)^T\phi(z)$ is “small.”

(Think about inner product property)

K(x,z)=\exp\left(-\frac{||x-z||^2}{2\sigma^2}\right)

called Gaussian kernel (the most widely used kernel).

Let $\{x^{(1)},\cdots,x^{(d)}\}$ be $d$ points and $K\in\mathbb{R}^{d\times d}$ be kernel matrix where $K_{ij}=K(x^{(i)},x^{(j)})$ .

Given any vector $z$ ,

\begin{aligned}z^TKz&=\sum_i\sum_jz_iK_{ij}z_j\\&=\sum_i\sum_jz_i\phi(x^{(i)})^T\phi(x^{(j)})z_j\\&=\sum_i\sum_jz_i\sum_k(\phi(x^{(i)}))_k(\phi(x^{(j)}))_kz_j\\&=\sum_i\sum_j\sum_kz_i(\phi(x^{(i)}))_k(\phi(x^{(j)}))_kz_j\\&=\sum_k\left(\sum_iz_i\phi(x^{(i)})_k\right)^2\\&\ge0.\end{aligned}

$\Rightarrow K\ge0$ (positive semi-definite).

Theorem (Mercer)

$K$ is a valid kernel function $(\text{i.e.}$ $\exist\;\phi\;\;\text{s.t.}\;K(x,z)=\phi(x)^T\phi(z))$ if and only if for any $d$ points $\{x^{(1)}, \cdots, x^{(d)}\}$ , the corresponding kernel matrix $K\ge0$ .

Linear kernel: $K(x,z)=x^Tz,\;\phi(x)=x.$

Gaussian kernel: $K(x,z)=\exp\left(-\frac{||x-z||^2}{2\sigma^2}\right),\;\phi(x)\in\mathbb{R}^\infty$ .

Note:

All of the learning algorithms we’ve learned so far can be written in linear product so that you can apply the kernel trick.

e.g. linear regression, logistic regression, everything of the generalized linear model family, the perceptron algorithm, etc.

$**L_1$ norm soft margin SVM:**

\begin{aligned}\min_{w,b}\;\;&\frac{1}{2}||w||^2+c\sum_{i=1}^m\xi_i\\\text{s.t.}\;\;&y^{(i)}(w^Tx^{(i)}+b)\ge1-\xi_i,\;\;i=1,\cdots,m,\;\\&\xi_i\ge0.\end{aligned}

Why use $L_1$ norm soft margin SVM?

Imagine the situation where we add one training example with label “x” but close to training examples labeled “o.” The basic optimal margin classifier will allow the presence of one training example to cause the decision boundary to swing dramatically because the original optimal margin classifier optimizes for the worst-case margin. → can have huge impact on the decision boundary.

However, $L_1$ norm soft margin SVM enables you to have the decision boundary still closer to the original one even when there’s an outlier.