[CS229] Supervised Learning, Part I

Sung Jae Hyuk·2023년 7월 1일

CS229

목록 보기

1/3

Notation

$\mathcal{X}$ , $\mathcal{Y}$ : 각각 Input value와 Output value의 Space
$(x^{(i)},\:y^{(i)})$ : Training example로, $x^{(i)}$ 은 Feature, $y^{(i)}$ 은 target(or output)을 의미.
$\{(x^{(i)},\:y^{(i)}\}_{i=1}^n$ : Sample이 총 $n$ 개인 Training dataset
$h\::\:\mathcal{X}\rightarrow\mathcal{Y}$ : Feature $x\in\mathcal{X}$ 에 대한 output 값을 도출해내는 가설(hypothesis)
$\mathcal{H}$ : Hypothesis $h$ 가 가질 수 있는 space (추후 PAC Learning때 소개)

들어가기 앞서

아래의 내용들은 최대한 CS229 Lecture Note, Spring '23의 표기 및 내용을 따라갑니다.
간혹 수학적인 내용들이 나오게 되는데, 그런 경우엔 최대한 보충 설명을 넣거나 Lecture Note보다는 자세하게 쓸 예정입니다.
기초적인 확률적 지식이나 선형 대수적 지식은 가지고 있으셔야 보기 편합니다.

Linear Regression

Motivation

집 가격 예측 문제를 생각하자. 다음은 면적과 방의 수, 그리고 그 집의 가격을 표로 나타낸 것이다.

\begin{array}{c|c|c} \text{Living area (feet}^2) & \text{\#Bedrooms} & \text{Price (1000\$s)} \\ \hline 2104 & 3 & 400 \\ \hline 1600 & 3 & 330 \\ \hline 2400 & 3 & 369 \\ \hline \vdots & \vdots & \vdots \\ \end{array}

즉, 우리가 하고 싶은 일은 집의 면적과 방의 수가 들어왔을 때, 그 집의 가격이 대략 얼마인지를 유추하고 싶은 것이다.
위의 내용을 일반적으로 표현하면, input vector $\mathbf{x}=\begin{bmatrix}x_1&x_2\end{bmatrix}^\top\in\R^2$ 이 들어왔을 때 올바른 output $y$ 를 찾는 함수 $f$ (혹은 hypothesis $h$ )를 찾는 문제이다.
그러한 $h$ 중 복잡한 형태도 많을 것이고 단순한 형태도 많을 것인데 그 중에서 가장 단순한 형태라고 할 수 있는 Linear 모델을 생각하자.
즉, $h(\mathbf{x})=\theta_0+\theta_1x_1+\theta_2x_2$ 라 표현하자. 이때, $\theta_i$ 들은 parameter라 하고 이들은 Input $\mathbf{x}\in\mathcal{X}$ 를 output space $\mathcal{Y}$ 로 mapping하는 역할을 한다. 위의 사항을 일반화하자.
$\mathcal{X}=\R^n$ 에 대해 $x\in\mathcal{X}$ 라 하고 parameter 벡터 $\theta$ 에 대해 선형 모델 $h\::\:\mathcal{X}\rightarrow\mathcal{Y}$ 는 다음과 같이 정의된다.

h(\mathbf{x})=\sum_{i=0}^n \theta_i x_i=\theta^\top \mathbf{x}^+

이때 $\mathbf{x}^+=\begin{bmatrix}x_0(=1)&x_1&\cdots&x_n\end{bmatrix}^\top$ 을 의미한다.
그러면, 이제 우리가 주목해야할 부분은 "저 parameter $\theta$ 를 어떻게 얻어낼 수 있을까?"가 된다. 뭐 방법은 모르겠지만, 우리가 예측한 값과 output 간의 정답 간의 차이가 작아야 한다는 것은 확실히 알 수 있다. 즉, 우리는 $|h(\mathbf{x})-y|$ 를 최소화하는 것을 목표로 잡을 수 있다. (흔히들 $L_1/\text{MAE}$ 라 함)
허나, 이렇게 잡아버리면 중심에서 미분이 불가능하고, 추후 사용할 Gradient Descent에서 Gradient가 고정이 되기 때문에 변화량이 값의 차이를 반영하지 못한다는 단점이 있다.
이 단점을 해결하기 위하여, 에러의 값을 제곱하여 합하는 것이 있고, 이를 MSE라고 한다.
미분가능성과 대칭성, Gradient의 값이 현재의 값을 반영한다는 점에서 자주 쓰이며, 추후 CS229에서도 딱히 말이 없으면 해당 Loss를 사용하는 것으로 간주한다. 표기의 편함을 위해, cost function $J(\theta)$ 를 다음과 같이 정의하자.

J(\theta)=\dfrac{1}{2}\sum_{i=1}^n (h_{\theta}(x^{(i)})-y^{(i)})^2

Gradient Descent

최종 목표는 $J(\theta)$ 를 최소화하는 $\theta$ 를 고르는 것이 목표이고, 가장 쉽고 잘 알려져 있는 방법은 Gradient Descent이다.
Gradient Descent는 현재 $\theta$ 의 값에서 최솟값까지 가는 가장 빠른 방향이(값 X) 현재 $\theta$ 에서의 Gradient임을 이용하여 $\theta$ 를 변화시키는 방법이다.
Update Formula로는 다음과 같다.

\theta_{t+1}\leftarrow \theta_t-\alpha\nabla_{\theta_t} J(\theta_t)

Gradient를 구하기 위해 $J(\theta)$ 를 $\theta$ 에 대한 미분값이 필요하고, 합은 각각 분리할 수 있으므로 $\theta$ 에 대한 $\dfrac{1}{2}(h_\theta (x^{i})-y^{(i)})^2$ 의 Gradient를 알아야한다. 일반성을 잃지않고, $J(\theta)=\dfrac{1}{2}(h_\theta(\mathbf{x})-y)^2$ 라 하자.

\begin{aligned} dJ(\theta)&=\langle 1,\:d\left(\dfrac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2\right)\rangle\\ &=\langle 1,\:h_\theta(x^{(i)}-y^{(i)})\times d(h_\theta(x^{(i)})-y^{(i)})\rangle\\ &=\langle h_\theta(x^{(i)})-y^{(i)},\:d(\theta^\top\mathbf{x}^{(i)}-y^{(i)})\rangle\\ &=\langle h_\theta(x^{(i)})-y^{(i)},\:(d\theta^\top)\mathbf{x}^{(i)}\rangle\\ &=\langle h_\theta(x^{(i)})-y^{(i)},\:(d\theta)^\top\mathbf{x}^{(i)}\rangle\\ &=\langle (h_\theta(x^{(i)})-y^{(i)})(\mathbf{x}^{(i)})^{\top},\:(d\theta)^\top\rangle\\ &=\langle \mathbf{x}^{(i)}(h_\theta(x^{(i)})-y^{(i)}),\:d\theta\rangle=\langle\nabla_{\theta} J(\theta),\:d\theta\rangle\\ \therefore\:\nabla_{\theta} J(\theta)&=\mathbf{x}^{(i)}(h_\theta(x^{(i)})-y^{(i)})\in\R^n \end{aligned}

이 Gradient를 Learning rate( $=\alpha$ )와 계속 곱해나가면서 최적점을 찾아나가는 것이 Gradient Descent이다.

위 사진이 Gradient Descent의 실제 진행 모습. Initial state에서 점점 최적으로 다가감을 알 수 있다.
이 때 전체에 대해 저 Loss를 다 모은 다음 Gradient Descent를 진행하지 않고, 하나의 데이터에 대해 Gradient Descent를 수행하는 것을 Stochastic Gradient Descent, $n$ 개의 mini-batch를 만든 후 그 Batch마다 수행하는 것을 Mini-batch Gradient Descent라 한다. 아~주 뒤에 수학적으로 다룰 예정

Probabilistic interpretation

위의 Linear Model을 생각하자. 위에서 표현하는 오차까지 하나의 식에 넣어버리면, $y=\theta^\top x+\varepsilon$ 으로 적을 수 있을 것이다. 이때, $\varepsilon$ 은 Feature $x$ 에 해당되지 않아서 모델이 고려하지 못하는 요소들을 말하며, 오차로도 볼 수 있다.
이때, 한 가지 강력한 가정을 들고오자. $\varepsilon\sim\mathcal{N}(0,\:\sigma^2)$ , 즉 오차가 평균이 $0$ 이고 분산이 $\sigma^2$ 인 i.i.d 정규분포를 따른다고 가정하자.
정규분포 $\mathcal{N}(0,\sigma^2)$ 의 pdf $p(x)=\dfrac{1}{\sqrt{2\pi}\sigma}\exp\left(-\dfrac{x^2}{2\sigma^2}\right)$ 이므로

p(\varepsilon)=\dfrac{1}{\sqrt{2\pi}\sigma}\exp\left(-\dfrac{\varepsilon^2}{2\sigma^2}\right)

이때 $\varepsilon=y-\theta^\top x$ 이므로

p(y|x;\theta)=\dfrac{1}{\sqrt{2\pi}\sigma}\exp\left(-\dfrac{(y-\theta^\top x)^2}{2\sigma^2}\right)

우리가 구하고 싶은 것은 데이터셋 $\{(x^{(i)},\:y^{(i)}\}_{i=1}^n$ 에 대해서 $p(y|x;\theta)$ 를 최대화시키는 $\theta$ 가 필요이다.
즉, $x$ 가 주어져있다는 상황에서 $\theta$ 를 정했을 때의 확률 $y$ 를 구하고, 이를 최대화하는 $\theta$ 를 미분으로 구하면 된다.
이를 Maximum Lieklihood Estimation (최대우도법)이라고 한다.
이를 기반으로 $\theta$ 에 대한 Likelihood function을 구하면

L(\theta)=\prod_{i=1}^n p(y^{(i)}|x^{(i)};\theta)

이를 미분하는 것은 굉장히 쉽지 않으므로, 양변에 $\log$ 를 씌운 후 고려하자. 즉,

\begin{aligned}l(\theta)&=\sum_{i=1}^n \log p(y^{(i)}|x^{(i)};\theta)\\ &=\log \dfrac{1}{\sqrt{2\pi}\sigma}\exp\left(-\dfrac{(y^{(i)}-\theta^\top x^{(i)})^2}{2\sigma^2}\right)\\ &=-n\log\sqrt{2\pi}\sigma-\dfrac{1}{2\sigma^2}\sum_{i=1}^n (y^{(i)}-\theta^\top x^{(i)})^2 \end{aligned}

여기서 $-n\log\sqrt{2\pi}\sigma$ 는 상수이므로 $l(\theta)$ 를 극대화하는 문제는 곧 $\displaystyle\sum_{i=1}^n (y^{(i)}-\theta^\top x^{(i)})^2$ 을 최소화하는 문제와 같고, 이는 앞에서 소개했던 Least square problem이다.

다음 글에서는 아래의 주제를 다룰 것이다.

Logistic regression
Multi-label classification
Expoential family
GLM

Sung Jae Hyuk

Hello World!

다음 포스트

[CS229] Supervised Learning, Part I

CS229

Notation

들어가기 앞서

Linear Regression

Motivation

Gradient Descent

Probabilistic interpretation

Next Post

[CS229M] Concentration Inequalities - Part I

0개의 댓글

관련 채용 정보