[Stanford CS229] Lecture7 - Kernels

jhbale11·2021년 7월 1일
1
post-thumbnail

1. Kernels

In Linear regression, we had a problem in which the input x was the living area of a house and we considered performing regression using the features x,x2,x3x, x^2, x^3 to obtain a cubic function. To distinguish betwwen these two sets of variables we'll call the original input value that input attributes of a problem.

When that is mapped to some new set of quantities that are then passed to the learning algorithm, we'll call those new quantities the input features. We will also let ϕ\phi denote the feature mapping, which maps from the attributes to the features.

In our example we had

ϕ(x)=[x,x2,x3]\phi(x) = [x, x^2, x^3]

Rather than applying SVMs using the original input attributes xx, we may instead want to learn using some features ϕ(x)\phi(x). To do so, we simply need to go over our previous algorithm, and replace xx everywhere in it with ϕ(x)\phi(x).

Since Algorithm can be written entirely in terms of the inner products <x,z>, this means that we would replace all those inner products with <ϕ(x),ϕ(z)><\phi(x), \phi(z)>. Specificially given a feature mapping ϕ\phi, we define the corresponding Kernel to be

K(x,z)=ϕ(x)Tϕ(z)K(x,z) = \phi(x)^T \phi(z)

Then everywhere we previously had <x,z> in our algorithm, we could simply replace it with K<x,z> and our algorithm would now be learning using the feature ϕ\phi.

We can also write it as this form:

Thus, we see that K(x,z)=ϕ(x)Tϕ(z)K(x,z) = \phi(x)^T \phi(z), where the feature mapping ϕ\phi is given by

위의 예시는 xx의 n이 3인 경우를 표현한 것이며, 만약 K(x,z)=(xTz+c)2K(x,z) = (x^T z+c)^2의 형태라면 feature mapping은 다음과 같을 것이다.

이를 일반적으로 표현하자면 K(x,z)=(xTz+c)dK(x,z) = (x^T z+c)^d의 경우에는 feature mapping은 (n+d)P(d)의 feature space를 가지게 될 것이지만 이 dimensional space가 O(nd)O(n^d)이더라도 K(x,z)K(x,z)를 계산하는데는 여전히 O(n)O(n) 만큼의 시간이 걸릴 것이다.

만약 x와 z가 비슷하다면 K(x,z)=ϕ(x)Tϕ(z)K(x,z) = \phi(x)^T \phi(z)는 클 것이며,
만약 x와 z가 비슷하지 않다면 K(x,z)=ϕ(x)Tϕ(z)K(x,z) = \phi(x)^T \phi(z)는 작을 것이다.
이는 feature mapping을 계산하는 과정에서 inner product가 이루어졌기 때문이다.

If you have any learning algorithm that you can write in terms of only inner products <x,z> between input attribute vectors, then by replacing this with K<x,z> where K is a kernel, you can "magically" allow your algorithm to work efficiently in the high dimensional feature space corresponding to K.

profile
서울대학교 자유전공학부 / 서울대학교 융합과학기술대학원

0개의 댓글