Support Vector Machines

‍이세현·2024년 4월 3일

Optimization Objective

Logistic regression 대안

Logitcis regression cost: $-(y\text{log}h_\theta(x) + (1-y)\text{log}(1-h_\theta(x)))$
- $y=1$ 일 때 $\theta^Tx\gg0$ 이면 loss가 0으로 수렴한다.
- $y=0$ 일 때 $\theta^Tx\ll0$ 이면 loss가 0으로 수렴한다.
Support vector machine cost: $y^{(i)} cost_1(\theta^T x^{(i)}) + (1 - y^{(i)}) cost_0(\theta^T x^{(i)})$
- Hinge loss $L(w) = \sum_{m=1}^{M}\text{max}(0, 1-y_m\hat{y}(x_m, w))$
- if $y_m=1, cost_1$
  - 1 이상의 값으로 예측한 경우 loss는 0이어야 한다.
    $\text{max}(0, 1-1\times1.\text{xx}) = 0$
  - 1 이하의 값으로 예측한 경우 loss는 0보다 커야한다.
    $\text{max}(0, 1-1\times0.\text{xx}) > 0$
- if $y_m=-1, cost_0$
  - -1 이하의 값으로 예측한 경우 loss는 0이어야 한다.
    $\text{max}(0, 1+1\times(-1.\text{xx})) = 0$
  - -1 이상의 값으로 예측한 경우 loss는 0보다 커야한다.
    $\text{max}(0, 1+1\times(-0.\text{xx})) > 0$

\text{min}_\theta C\sum_{i=1}^{m}y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)}) + \frac{1}{2}\sum_{i=1}^{n}{\theta_j}^2

Large Margin Intuition

$y = 1$ 일 때 $\theta^Tx \geq 1$
- $cost_1 = \text{max}(0, 1-\hat{y})$
- 예측값이 1보다 작으면 loss가 커진다.
- 정답에 가까워지기 위해서 예측값은 1보다 커야 한다.
$y = 0$ 일 때 $\theta^Tx \leq -1$
- $cost_0 = \text{max}(0, 1+\hat{y})$
- 예측값이 -1보다 크면 loss가 커진다.
- 정답에 가까워지기 위해서 예측값은 -1보다 작아야 한다.
logistic regression과 달리 decision boundary는 다양할 수 있다.
- 분류만 잘 한다고 해서 좋은 boundary인지는 알 수 없다.
- data는 잘 표현하지만 다른 input이 들어왔을 때 적절하지 않을 수 있다.
- SVM은 margin을 최대화하는 decision boundary를 찾는다.

SVM Decision Boundary

$y = 1$ 일 때 $\theta^Tx \geq 1$ 이고 $y = 0$ 일 때 $\theta^Tx \leq -1$ 을 만족하면 margin을 고려하여 정규화를 해야한다.
- $\text{min}_\theta \sum_{j=1}^{n}{\theta_j}^2$

$\theta$ 와 decision boundary는 항상 수직이다.

Large margin classifier in presence of outliers

C=\frac{1}{\lambda}

$C$ 가 커지면 규제가 약해지므로 overfitting이 발생한다.
$C$ 에 따라 boundary fitting 정도가 달라진다.

The mathematics behind large margin classification

Vector Inner Product

$\theta$ : model
$x$ : input data
$\theta^Tx$ : $x$ 를 model $\theta$ 에 투영
- $\theta^Tx = \|\theta\|\|x\|\cos\alpha$
- $p = \|x\|\cos\alpha$
- $\theta^Tx$
  - 투영한 $p$ 와 $\theta$ 를 곱하는 것
  - $\theta$ 에 정사영했을 때 방향과 길이 값을 통해 학습을 진행한다.

SVM Decision Boundary

$y^{(i)}=1$ 일 때
- $\theta^T x^{(i)} = \|\theta\|\|x\|\cos\alpha \geq 1$
- data를 모델에 정사영했을 때 model( $\theta$ )과 같은 방향, 길이는 적당히 길어야 한다.
$y^{(i)}=0$ 일 때
- $\theta^T x^{(i)} = \|\theta\|\|x\|\cos\alpha \leq -1$
- data를 모델에 정사영했을 때 model( $\theta$ )과 반대 방향, 길이는 적당히 길어야 한다.

Kernels

Non-linear Decision Boundary

Decision boundary는 항상 linear 형태일 수 없다.
- 특정 그래프의 내부, 외부로 boundary가 결정되는 경우도 있다.
- feature의 조합으로 decision boundary가 계산된다.
- $y=1 \text{ if } \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1x_2 + \theta_4{x_1}^2 + ... \geq 0$
hypothesis의 변수가 입력의 조합으로 이루어진 경우 feature로 대체한다.
- $\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1x_2 + \theta_4{x_1}^2 + ...$
  $=\theta_0+\theta_1f_1 + \theta_2f_2 + \theta_3f_3 + \theta_4f_4 + ...$

Kernel

선형적으로 데이터를 분류할 수 없을 때 데이터를 높은 차원으로 이동시켜 고차원 공간에서 데이터를 분류할 수 있다.
- Kernel: similarity를 구하는 함수
$f$ : 기존의 feature $x$ 를 변형시켜 만든 새로운 feature
- $f$ 는 landmark $l$ 과 $x$ 의 유사도로 정의한다.
- landmark $l$ 은 임의로 정한다.
- $f = \text{similarity}(x, l^{(i)})$ : data와 landmark의 유사도

Kernels and Similarity

f_1=\text{similarity}(x,l^{(i)})=\exp(-\frac{{\|x-l^{(1)}\|}^2}{2\sigma^2})

Gaussian kernel 사용

$x$ 와 $l^{(1)}$ 이 거의 같을 때 $f_1$ 은 거의 1로 수렴한다.
$x$ 와 $l^{(1)}$ 이 큰 차이가 날 때 $f_1$ 은 거의 0으로 수렴한다.

Example

\text{Predict 1 when } \theta_0 + \theta_1f_1 + \theta_2f_2 + \theta_3f_3 \geq 0

$\theta_0=0.5, \theta_1=1, \theta_2=1, \theta_3=0$
- $l^{(1)}$ 과 $l^{(2)}$ 에 가까우면 1
$\theta_2=-1$
- $l^{(2)}$ 에 가까우면 0

Choosing the landmarks

임의로 점 $l$ 을 찍고 $\theta$ 를 학습한다.

SVM with Kernels

모든 data를 사용하여 landmark를 결정한다.

Given $(x^{(1)}, y^{(1)}),(x^{(2)}, y^{(2)}),...,(x^{(m)}, y^{(m)})$
choose $l^{(1)}=x^{(1)},l^{(2)}=x^{(2)},...,l^{(m)}=x^{(m)}$
$f_1=\text{similarity}(x,l^{(1)})$
$f^{(i)}=\begin{bmatrix} f_0^{(i)} \\ f_1^{(i)} \\ \vdots \\ f_m^{(i)} \end{bmatrix}$

SVM parameters C

C = \frac{1}{\lambda}

Large C: $\lambda$ 가 작으므로 정규화가 덜 되어 overffiting
- variance가 크기 때문에 model이 data의 영향을 많이 받는다.
Small C: $\lambda$ 가 크므로 정규화가 심하게 되어 underfitting
- variance가 작다.

Gaussian kernel의 변수

\sigma^2

Large $\sigma^2$ : $f$ 가 넓게 분포되어 있다.
- 거리 값이 모두 비슷하여 underfitting
- 모델의 분산은 작아진다.
- Higher bias
Small $\sigma^2$ : $f$ 가 평균 가까이 분포되어 있다.
- landmark에 매우 민감한 overfitting
- 모델의 분산은 커진다.
- Lower bias

Using an SVM

Other choices of Kernel

모두 거리 기반 kernel인 것은 아니다.
String kernel, chi-square kernel, histogram intersection kenler, ...

Multi-class classification

$(\theta^{(i)})^Tx$ 가 가장 큰 $i$ 를 선택한다.
- one vs all method
- 투영하였을 때 그 값이 가장 크다는 것은 해당 class에서 margin이 가장 크다는 뜻이므로 그 class에 속할 확률이 가장 높다.

Logistic regression vs SVMS

$n$ 은 feature의 개수, $m$ 은 훈련 dataset의 개수일 때

$n$ 이 $m$ 보다 크다면
- logistic regression 또는 linear kernel을 사용하는 것이 좋다.
- SVM을 사용할 경우 overfitting이 발생할 수 있다.
$n$ 이 작고 $m$ 이 적당히 있다면
- Gaussian kernel을 이용하여 SVM을 사용한다.
- feature가 적으므로 feature를 조합하여 고차원 공간에서 데이터를 분류한다.
$n$ 이 작고 $m$ 이 크다면
- feature를 늘린 후에 logistic regression이나 linear kernel를 사용하는 것이 좋다.
Neural network는 대부분의 설정에서 SVM을 대체하여 잘 작동하지만 훈련 속도가 느릴 수 있다.

‍이세현

Hi, there 👋

이전 포스트

Clustering

다음 포스트

Support Vector Machines

Optimization Objective

Logistic regression 대안

Large Margin Intuition

SVM Decision Boundary

Large margin classifier in presence of outliers

The mathematics behind large margin classification

Vector Inner Product

SVM Decision Boundary

Kernels

Non-linear Decision Boundary

Kernel

Kernels and Similarity

Example

Choosing the landmarks

SVM with Kernels

SVM parameters C

Gaussian kernel의 변수

Using an SVM

Other choices of Kernel

Multi-class classification

Logistic regression vs SVMS

Clustering

Deep Learning Basics 2

0개의 댓글

관련 채용 정보