1.3 MAP

Eony_Jahng·2021년 12월 22일

인공지능

목록 보기

2/14

KOOC에서 제공하는 KAIST 문인철 교수님의 "인공지능 및 기계학습 개론 1" 수업입니다.

지난 1.2 MLE 수업에서 압정이야기를 가져와서...

지난 수업에서 압정을 5번 던져 앞면이 3번, 뒷면이 2번이 나온 것을 통해 $\theta=0.6$ 이라는 결과가 나왔다. 그런데 상식적으로 압정을 던져 앞면 혹은 뒷면이 나올 확률은 50:50이 아닌가? 우리가 알고있는 사전정보를 고려한 확률을 구할 수 있지 않을까? ~~~라고 생각한 것이 Bayes라는 사람이다.

Bayes Theorem

P(\theta|D)=\frac{P(D|\theta)P(\theta)}{P(D)}\\ Posterior = \frac{Likelihood*Prior Knowledge}{Normalizing Constant}

즉, 데이터가 주어졌을때 $\theta$ 을 관측할 확률 Posterior는 위와 같이 정의할 수 있다. 우리가 찾고자 하는 것은 P( $\theta$ )이다.

Incorporating Prior Knowledge

우리는 이미 Likelihood에 대해 $P(D|\theta)=\theta^{a_H}(1-\theta)^{a_T}$ 로 정의했다. 그리고 우리가 50:50이지 않을까 생각한 것이 prior knowledge $P(\theta)$ 로 적용할 수 있다.

More Formula from Bayes Viewpoint

Bayes Theorem에서 P(D)는 이미 일어난 사건에 대한 것이므로 normalizing constant로 취급하여,

P(\theta|D) \propto {P(D|\theta)P(\theta)}

가 된다. 이때 $P(D|\theta)$ 는 알고있으니 $P(\theta)$ 는 어떻게 표현할까? 50:50이니 0.5로 표현할 수 있을까? 아니다. 우리가 $P(D|\theta)$ 를 Binomial Dist으로 표현했듯이 어떤 Distribution에 의존해 표현할 것이다.

이것을 Bayes은 Beta Distribution을 사용했다. Beta Dist는 특정범위에서 0~1로 confine되어있는 CDF(Cumulative Distribution Function)이기 때문에 Probability 성격을 띄고 있다.

Beta Dist은 다음과 같이 PDF(Probabiliy Density Function)을 이용해 표현할 수 있다.

P(\theta)=\frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)},\quad B(\alpha,\beta)=\frac{\gamma(\alpha)\gamma(\beta)}{\gamma(\alpha+\beta)},\ \gamma(\alpha)=(\alpha-1)!

즉, Beta dist를 표현하기 위한 parameter는 $\alpha와 \beta$ 이다.

이를 이용해 Bayes Theorm을 정리하면

$P(\theta|D) \propto {P(D|\theta)*P(\theta)} \propto \theta^{a_H}(1-\theta)^{a_T}*\theta^{\alpha-1}(1-\theta)^{\beta-1}\ \\=\ \theta^{a_H+\alpha-1}(1-\theta)^{a_T+\beta-1}$

이때 $B(\alpha,\beta)$ 는 $\theta$ 에 의존하지 않는 constant 값이기 때문에 $\propto$ 로 처리가 가능하다.

이제 이것을 이용해 most probable and more approximate $\theta$ 을 찾아보자.

Maximum a Posteriori Estimation

MLE에서는 $\hat{\theta}=argmax_\theta P(D|\theta)$ 에서 $\theta$ 를 찾았다.

$P(D|\theta) = \theta^{a_H}(1-\theta)^{a_T}$
$\hat{\theta}=\frac{a_H}{a_T+a_H}$

이번에는 MAP에서 $\hat{\theta}=argmax_\theta P(\theta|D)$ , 즉 Likelihood에 대해서가 아니라 posterior에 대해 maximize하는 것이다.

$P(\theta|D) \propto \theta^{a_H+\alpha-1}(1-\theta)^{a_T+\beta-1}$
$\hat{\theta} =\frac{a_H+\alpha-1}{a_H+\alpha+a_T+\beta-2}$

$\hat\theta$ 을 구하는 과정은 미분을 이용해 극점을 구하는 방식으로 MLE와 동일하지만, 관점이 다르다. Likelihood가 아닌 Posterior에 대해 maximize한 결과이다.

Prior Knowledge를 이용해 $\alpha$ 와 $\beta$ 를 조정해가며 사전정보를 고려하여 $\hat{\theta}$ 를 구할 수 있다는 것이다.

MLE and MAP

Conclusion from Anedote

MLE: \hat{\theta}=\frac{a_H}{a_T+a_H}\\ MAP: \hat{\theta} =\frac{a_H+\alpha-1}{a_H+\alpha+a_T+\beta-2}

그럼 결국 MLE와 MAP는 다른건가? 둘의 목적은 most probable and more approximate θ를 찾는 것이지 않은가?

사실은 둘은 같다. $a_H$ 와 $a_T$ 가 커지면, $\alpha$ 와 $\beta$ 의 영향력은 줄어든다. 하지만 $a_H$ 와 $a_T$ 가 작다면 사전정보는 중요한 역할을 할 것이다. 그렇다면 사전정보(Prior) $\alpha$ 와 $\beta$ 는 어떻게 결정할까??????

관측값이 많지 않다면 MLE와 MAP는 다른 값이 나올 수도 있고, MAP에서는 잘못된 Prior를 선택한다면 좋지못한 결과가 나올수도 있다.

Eony_Jahng

7층에 사는 동언이

이전 포스트

1.2 MLE

다음 포스트