Lecture 3

리치·2023년 3월 24일

cs229 머신러닝

CS229

목록 보기

3/21

deep.daiv 동아리에서 진행했으며 팀원과 함께 정리한 내용입니다.

Lecture 3

Probabilistic interpretation

확률적 해석을 하는 이유!

“cost function J가 왜 최소 제곱의 합의 꼴인가에 대한 유도를 하기 위해 “

What is Probability Distribution?

확률 분포란: 수집 및 관측된 데이터의 발생 확률을 잘 근사하는 분포

보통 $p(x|\theta)$ 로 표현한다.

$**\theta$ 추정의 목적**: 데이터의 실제 확률 분포를 최대한 잘 근사하는 수학적 모형을 찾는것

모델 표현하는 확률 분포를 데이터의 실제 분포에 가깝게 만드는 최적 파라미터 값 찾기

$y^i= \theta^Tx^i+\epsilon^i$ 라고 가정할 때 $\epsilon^i$ =error, unmodeled effects, random noises이다.

$\epsilon \sim N(0,\sigma^2)$ (정규분포, 가우스분포)

$p(\epsilon^i)={1 \over \sqrt(2\pi\sigma)}exp({-\epsilon^2\over2\sigma^2})$ (엡실론 i의 확률밀도)

$p(y^i|x^i;\theta)={1 \over \sqrt(2\pi\sigma)}exp({-(y^i-\theta^Tx^i)^2\over2\sigma^2})$ $(y^i- \theta^Tx^i=+\epsilon^i)$

(ex: x와 theta가 주어졌을 때, 특정 주택 가격의 확률)

( $\theta$ 에 의해 매개변수화 되었고 x가 주어졌을 때 y의 확률밀도)

Maximum Likelihood Estimation (MLE)

📢 **Likelihood VS Probability:**

가능도: 데이터는 고정이고, parameter가 변수로 움직이면서 변화한다.

확률: 매개변수는 고정이고, Y값에 대한 확률

관측된 데이터 $X=(x_1,x_2,x_3,...,x_n)$ 를 토대로 상정한 확률 모형이 데이터를 잘 설명하도록 하는 $\theta$ 를 찾는 방법
n개의 관측된 데이터의 발생 확률이 전체적으로 최대가 되도록 하는 parameter를 찾는 방법

$L(\theta)=L(\theta;X,\vec y)=p(\vec y|X;\theta)$

$L(\theta)=\prod_{i=1}^{m} p(y^i|x^i;\theta)=\prod_{i=1}^{m} {1 \over \sqrt(2\pi\sigma)}exp({-(y^i-\theta^Tx^i)^2\over2\sigma^2})$ (목표는 $L(\theta)$ 를 최대화 시키는것!)

(위의 식에서 각각의 데이터 발생사건이 독립적으로 발생한다고 가정한다.)

→ likelihood function은 곱의 꼴이기 때문에 log를 취해 더하는 꼴로 변경 (또한 확률의 곱은 기하급수적으로 작아지기 때문에 더하는 꼴로 변경한다.)

$l(\theta)=logL(\theta)=log\prod_{i=1}^{m} {1 \over \sqrt(2\pi\sigma)}exp({-(y^i-\theta^Tx^i)^2\over2\sigma^2})=\sum_{i=1}^{m}{1 \over \sqrt(2\pi\sigma)}exp({-(y^i-\theta^Tx^i)^2\over2\sigma^2})$

${mlog\over \sqrt(2\pi\sigma)}-{1\over \sigma^2}\times{1\over2}\sum_{i=1}^m(y^i-\theta^Tx^i)^2$

위의 함수를 최대화 시키기 위해서 ${1\over2}\sum_{i=1}^m(y^i-\theta^Tx^i)^2$ 를 최소로 만들어야 한다.

머신러닝의 확률적 해석에 대한 그림

$**{1\over2}\sum_{i=1}^m(y^i-\theta^Tx^i)^2$ 는 $J(\theta)$ 꼴로 왜 cost function이 제곱의 합 꼴인지 알아보았다.**

Locally weighted linear regression

📢 Parametric Model VS Non-Parametric Model Parametric Model: 고정된 개수의 파라미터를 학습한다. Non-Parametric Model: 학습 데이터가 늘어남에 따라 파라미터의 개수도 늘어난다.

Locally weighted linear regression은 Non-Parametric Model에서 주로 쓰인다.

왜 국소 가중 회귀가 필요한가?
좁은 범위의 비 선형 데이터를 선형데이터로 해석하여 목표하는 점 근처의 값으로만 계산한다.

Linear Regression: Fit $\theta$ to minimize $cost function=J(\theta) = \frac{1}{2}\sum_{i=1}^{ m } {(y^i-\theta^Tx^i)^2}$

Locally weighted regression: Fit $\theta$ to minimize $\sum_{i=1}^{ m } w^i{(y^i-\theta^Tx^i)^2}$ (w=weight function)

$w^i=exp({-(x^i-x)^2 \over 2})$

Untitled

if $|x^i-x|$ is small, $x^i \approx 1$

if $|x^i-x|$ is larger, $x^i \approx 0$

$x^1-x$ 의 값이 작으므로 w는 1에 가까워지고

$x^2-x$ 의 값은 크므로 w는 0에 가까워져서

예측을 위한 데이터에서 제외된다.

x로부터 가까이 위치한 값만 데이터에 이용

Bandwidth Parameter tau( $\tau$ )

x값으로부터 얼마다 가까운 데이터만 이용할 지 결정하는 parameter

$w^i=exp({-(x^i-x)^2 \over 2\tau^2})$

Locally weighted linear regression: feature이 2,3개로 적지만 많은 데이터를 가지고 있을 때 주로 사용

Logistic regression

Classification 모델로서 사용되는 회귀이다.
binary classification에서 y값은 오직 0 또는 1을 가진다.

$h_\theta(x) \in[0,1]$ , $h_\theta(x)=g(\theta^Tx)={1 \over 1+e^-\theta^Tx}$ , $g(z)= {1\over{1+e^-z}}$

g(z)= logistic function or sigmoid function

Untitled

$p(y=1|x;\theta)=h_\theta(x)$ (ex: 종양의 크기를 고려할 때 y=1일 확률)

$p(y=0|x;\theta)=1-h_\theta(x)$ (ex: 종양의 크기를 고려할 때 y=0일 확률)

한 개의 식으로 압축했을 때: $p(y|x;\theta)=h_\theta(x)^y(1-h_\theta(x))^{1-y}$

선형회귀에서 maximum likelihood estimation을 사용한 것처럼 로지스틱 회귀에서도 MLE를 사용하도록 한다. ( 가장 확률을 높이는 parameter를 구하기 위해서)

$L(\theta)=p(\vec y|X;\theta)=\prod_{i=1}^{m} p(y^i|x^i;\theta)=\prod_{i=1}^{m}h_\theta(x)^y(1-h_\theta(x))^{1-y}$

$l(\theta)=logL(\theta)=\sum_{i=1}^{m}y^i\times logh(x^i) + (1-y^i)log(1-h(x^i))$

Goal: choose $\theta$ to maximize $l(\theta)$ 이므로 $l(\theta)$ 를 미분한다.

$\partial\over{\partial \theta_j}$ $l(\theta)=(y\times {1 \over g(\theta^Tx)}-(1-y) {1\over1-g(\theta^Tx)}) {\partial\over{\partial \theta_j}}$ $g(\theta^Tx)$

$(y\times {1 \over g(\theta^Tx)}-(1-y) {1\over1-g(\theta^Tx)})g(\theta^Tx)(1-g(\theta^Tx)){\partial\over{\partial \theta_j}}\theta^Tx$

by $g'(z)={d\over dz}{1 \over 1+e^-z} ={1 \over (1+e^-z)^2}\times e^-z = {1 \over 1+e^-z}(1-{1 \over 1+e^-z})=g(z)(1-g(z))$

$=(y(1-g(\theta^Tx)-(1-y)g(\theta^Tx))x_j=(y-h_\theta(x))x_j$

Stochastic gradient descent 의 꼴과 같다고 할 수 있다.

$\theta_j := \theta_j +\alpha (y^i-h_\theta(x^i))x_j^i$

logistic regression

$\theta_j := \theta_j + \frac\partial{\partial\theta_j}l(\theta)\space\space$ ( $l(\theta)$ 를 최대화 시켜야하기 때문에 부호가 양수)

gradient descent

$\theta_j := \theta_j - \frac\partial{\partial\theta_j}J(\theta)\space\space$ ( $J(\theta)$ 를 최소화 시켜야 하기 때문에 부호가 음수, 기울기가 양수라면 음의 방향으로 옮겨야 하고, 기울기가 음수라면 양의 방향으로 옮겨야 최소가 나온다.)