Cs236 Lecture6

JInwoo·2025년 1월 8일

cs236

목록 보기

4/15

Intractable Posteriors

앞선 lecture에서 살펴본 대로 ELBO가 likelihood $p_{\theta}(\mathbf{x})$ 와 같아지는 조건은 $q(\mathbf{z})=p(\mathbf{z|x};l\theta)$ 일 때 뿐이다. 따라서 posterior를 구할 수 있으면 likelihood 역시 구할 수 있다. 그러나 posterior 역시 likelihood와 마찬가지로 대부분의 경우 intractable 하다. 이에 대한 대안으로 posterior를 approximation 하는 방법이 있다. $q(\mathbf{z};\phi)$ 가 $p(\mathbf{z|x};\theta)$ 가능한 가까워지도록 $\phi$ 를 선택하면 좋은 posterior에 대한 좋은 approximation을 얻을 수 있다. 이를 variational inference라고 한다.

The Evidence Lower bound applied to the entire dataset

$q(\mathbf{z};\phi)$ 를 이용한 ELBO는 다음과 같다

$\underset{\mathbf{z}}{\sum}q(\mathbf{z};\phi)\log p(\mathbf{z,x};\theta) + H(q(\mathbf{z;\phi))}=\mathcal{L(\mathbf{x};\theta,\phi)}$

따라서 MLE의 경우를 생각하면 다음의 식을 얻을 수 있다.

$\underset{\theta}{\max}\underset{\mathbf{x}^i\in\mathcal{D}}{\sum}\log p(\mathbf{x}^i,\theta)\ge\underset{\theta,\phi^1,\cdots,\phi^M}{\max}\underset{\mathbf{x}^i\in\mathcal{D}}{\sum}\mathcal{L}(\mathbf{x}^i;\theta,\phi^i)$

위 식에서 주목 할 점은, $\phi^i$ 이다. variational parameter $\phi$ 는 모든 data point $\mathbf{x}^i$ 마다 다르다. 왜냐하면 true posterior $p(\mathbf{z|x};\theta)$ 가 모든 data point 마다 다르기 때문이다.

Learning Deep Generative Models

Gradient descent 알고리즘을 이용한 학습 시나리오를 생각해보자. 우선 ELBO의 식을 아래처럼 변경 가능하다.

$\mathcal{L}(\mathbf{x}^i;\theta,\phi^i)=\underset{\mathbf{z}}{\sum}q(\mathbf{z};\phi)\log p(\mathbf{z,x};\theta) + H(q(\mathbf{z;\phi))}$
$\qquad\qquad\quad=E_{q(\mathbf{z};\phi^i)}[\log p(\mathbf{z,x}^i;\theta)-\log q(\mathbf{z};\phi^i)]$

$\theta$ 와 $\phi^i$ 를 update 하기 위해 각각의 gradient를 구해야한다. 일반적으로 closed form 형태로 gradient를 구하기 어렵기 때문에 monte carlo sampling을 이용한다.

$E_{q(\mathbf{z};\phi)}[\log p(\mathbf{z,x};\theta)-\log q(\mathbf{z};\phi)]\approx\frac{1}{K}\underset{k}{\sum}\log p(\mathbf{z}^k, \mathbf{x};\theta) -\log q(\mathbf{z}^{k};\phi)$ ( $i$ 는 식의 compactness를 위해 잠시 생략)

위 식으로 부터 얻고 싶은 gradient는 $\nabla_{\theta}\mathcal{L}(\mathbf{x};\theta,\phi)$ 와 $\nabla_{\phi}\mathcal{L}(\mathbf{x};\theta,\phi)$ 이다. $\nabla_{\theta}\mathcal{L}(\mathbf{x};\theta,\phi)$ 는 다음과 같이 쉽게 구할 수 있다.

$\nabla_{\theta}E_{q(\mathbf{z};\phi)}[\log p(\mathbf{z ,x};\theta) -\log q(\mathbf{z};\phi)]=E_{q(\mathbf{z};\phi)}[\nabla_{\theta}\log p(\mathbf{z ,x};\theta)]\approx\frac{1}{K}\underset{k}{\sum}\nabla_{\theta}\log p(\mathbf{z}^k,\mathbf{x};\theta)$ ( $q$ 로 부터 $\mathbf{z}$ 를 $K$ 개 sampling.)

그러나 $\nabla_{\phi}\mathcal{L}(\mathbf{x};\theta,\phi)$ 는 쉽게 구하기가 어렵다. 왜나하면 expectation 값이 $\phi$ 에 관한 것이기 때문이다.

$\nabla_{\phi}E_{q(\mathbf{z};\phi)}[\log p(\mathbf{z, x};\theta) - \log q(\mathbf{z};\phi)] \ne E_{q(\mathbf{z};\phi)}[\nabla_{\phi}(\log p(\mathbf{z, x};\theta) - \log q(\mathbf{z};\phi))]$

따라서 approximation 할 다른 방법이 필요하다.

Reparameterization

$\mathbf{z}$ 를 적절히 변환하면 $\mathbf{z}$ 에 대한 gradient를 approximation 할 방법을 찾을 수 있다. 먼저 $q(\mathbf{z};\phi)=\mathcal{N}(\mu,\sigma^2I)$ 로 가정하면 다음 두 가지의 동일한 sampling 식을 얻을 수 있다.

Sample $\mathbf{z}\sim q(\mathbf{z};\phi)$ , $\phi=(\mu,\sigma)$
Sample $\epsilon\sim\mathcal{N}(0,I),\mathbf{z}=\mu+\sigma\epsilon=g(\epsilon;\phi)$ ( $\mathbf{z}$ 를 shift and rescale)

위 식에서 $g$ 는 deterministic 함수이다. 따라서 앞서 본 expectaion은 다음과 같이 쓸 수 있다.

$E_{\mathbf{z}\sim q(\mathbf{z};\phi)}[r(\mathbf{z})]=\int q(\mathbf{z};\phi)r(\mathbf{z})d\mathbf{z}=E_{\epsilon\sim\mathcal{N}(0,I)}[r(g(\epsilon;\phi)]$
$\nabla_{\phi}E_{\mathbf{z}\sim q(\mathbf{z};\phi)}[r(\mathbf{z})]=\nabla_{\phi}E_{\epsilon}[r(g(\epsilon;\phi))]=E_{\epsilon}[\nabla_{\phi}r(g(\epsilon;\phi))]$

따라서 앞서 본것 과는 다르게 $\nabla_{\phi}$ 를 쉽게 approximation 할 수 있다.(monte carlo 이용)

$E_{\epsilon}[\nabla_{\phi}r(g(\epsilon;\phi))]\approx\frac{1}{K}\underset{k}{\sum}\nabla_{\phi}r(g(\epsilon^k;\phi))$ ( $\epsilon$ 을 $\mathcal{N}(0, I)$ 로 부터 $K$ 개 sampling)

다시 본래의 loss 식 $\mathcal{L}(\mathbf{x};\theta,\phi)$ 을 생각하면 transformation 식이 $r(\mathbf{z})$ 가 아닌 $r(\mathbf{z},\phi)$ 임을 알 수 있다.

$E_{q(\mathbf{z};\phi)}[\log p(\mathbf{z, x};\theta) - \log q(\mathbf{z};\phi)]=E_{q(\mathbf{z};\phi)}[r(\mathbf{z,\phi})]$

조금 더 복잡해지긴 했지만, chain rule 이용하여 전과 같이 쉽게 gradient의 approximation을 구할 수 있다.

$E_{q(\mathbf{z};\phi)}[r(\mathbf{z,\phi})]=E_{\epsilon}[r(g(\epsilon;\phi),\phi)]\approx\frac{1}{K}\underset{k}{\sum}r(g(\epsilon^k;\phi),\phi),\ \mathbf{z}=\mu+\sigma\epsilon=g(\epsilon;\phi)$

Amortized Inference

앞서 설명한대로 variational parameters $\phi$ 는 data point $\mathbf{x}^i$ 에 따라 다르다. 따라서 dataset이 커지게 되면 variational parameters를 학습하는데 무리가 간다. amortization을 이용하면 이러한 문제를 해결할 수 있다.

학습시 모든 variational parameters를 학습하는 것이 아닌, 하니의 parametic function $f_\lambda$ 를 학습하는 것이다. $f_\lambda$ 는 $\mathbf{x}^i$ 를 $\phi^i$ 로 mapping 하는 함수이다.

$f_\lambda:\mathbf{x}^i\mapsto\phi^i$

따라서 posterior은 $q(\mathbf{z};f_\lambda(\mathbf{x}^i))$ 로 생각될 수 있고 일반적으로 $q_\phi(\mathbf{z|x})$ 로 표기된다. 이제 $f_\lambda$ (일반적으로 neural network)를 학습함으로써 posterior를 쉽게 approximation 할 수 있다. 이를 amortized inference라고 한다.

Autoencoder Perspective

앞의 내용들을 총 집합하면 다음과 같은 loss 식을 얻는다.

$\mathcal{L}(\mathbf{x};\theta,\phi)=E_{q_\phi(\mathbf{z|x})}[\log p(\mathbf{z,x};\theta) -\log q_\phi(\mathbf{z|x})]$

위 식의 $p(\mathbf{z})$ 을 더하고 빼서 다음과 같은 변형식을 얻을 수 있다.

$E_{q_\phi(\mathbf{z|x})}[\log p(\mathbf{z,x};\theta)-\log p(\mathbf{z})+\log p(\mathbf{z}) -\log q_\phi(\mathbf{z|x})]$
$= E_{q_\phi(\mathbf{z|x})}[\log p(\mathbf{x|z};\theta)]-D_{KL}(q_\phi(\mathbf{z|x)}||p(\mathbf{z}))$

위 식에서 첫 번째 term은 실제 $\mathbf{x}$ 와 최대한 같아지도록 만든다. 즉 reconstruction loss로 생각 할 수 있다. 반면 두 번째 term은 $q_\phi$ 에 의해 sampling 되는 $\mathbf{z}$ 가 prior $p(\mathbf{z})$ 와 최대한 닮도록 만든다. 즉 regularization loss로 생각 할 수 있다.