Cs236 Lecture4

JInwoo·2024년 12월 2일

cs236

목록 보기

2/15

Learning A Generative Model

Setting

$P_{data}$ : domain에 대한 underlying distribution. unknown 상태.
$\mathcal{D}$ : $P_{data}$ 로 부터 sampling한 dataset.
IID: dataset의 모든 sample은 independent and identically distributed.

Goal of Learning

Learning의 목적은 주어진 dataset을 가지고 $P_{data}$ 를 잘 approximation하는 model $P_{\theta}$ 를 찾는 것이다.

$P_{data}$ 와 동일한 $P_{\theta}$ 를 찾는 것은 현실적으로 불가능하다. 왜냐하면, 주어진 dataset은 underlying disturibution의 일부(subset)이고, sample이 갖는 데이터의 dimension은(parameters)는 대부분 고차원이다. 따라서 dataset은 항상 sparse coverage를 갖게된다.

What is Best?

어떤 $P_{\theta}$ 가 잘 approximation한 model인지 알기 위해서는 model을 평가할 기준이 필요하다. generative model 관점에서는 $P_{data}$ distribution과 얼마나 유사한 distribution을 갖는지를 평가 기준으로 삼을 수 있다. (density estimation) 즉, 두 distribution의 distance가 얼마나 작은지를 가지고 model을 평가한다.

Minminze $d(P_{data}, P_{\theta})$

KL Divergence

두 distribution에 대한 distance 척도로 KL divergence를 이용할 수 있다.

$D(p||q)=\underset{\mathrm{x}}{\sum}p(\mathbf{x})\log\frac{p(\mathbf{x})}{q(\mathbf{x})} \ge0$

KL divergence는 다음과 같은 특징을 갖는다.

$p$ 와 $q$ 가 동일한 경우에만 0이 됨
asymestric. $D(p||q)\ne D(q||p)$

Maximum Likelihood

KL divergence를 이용하여 $P_{data}$ 와 $P_{\theta}$ 의 distance를 평가하면 다음과 같은 식을 얻을 수 있다.

$D(P_{data}||P_{\theta})=E_{\mathbf{x}\sim P_{data}}[\log(\frac{P_{data}(\mathbf{x})}{P_{\theta}(\mathbf{x})})]=E_{\mathbf{x}}\sim P_{data}[\log P_{data}(\mathbf{x})]-E_{\mathbf{x}\sim P_{data}}[\log P_{\theta}(\mathbf{x})]$

위 식 맨 마지막 수식의 first term은 $\theta$ 에 depend 하지 않기 때문에, $\theta$ 학습과 무관하다. 따라서 무시가 가능하고, first term을 제외하고 보면 likelihood와 동일한 수식임을 알 수 있다. 즉, KL divergence를 minimizing 하는 것은, likelihood를 maximizing 하는 것과 동일하다.(Maximum likelihood)

$\underset{P_{\theta}}{\argmin}D(P_{data}||P_{\theta})=\underset{P_{\theta}}{\argmin}-E_{\mathbf{x}\sim P_{data}}[\log P_{\theta}(\mathbf{x})]=\underset{P_{\theta}}{\argmax} E_{\mathbf{x}\sim P_{data}}[\log P_{\theta}(\mathbf{x})]$

Maximum likelihood의 가장 큰 단점은, true distribution $P_{data}$ 와 얼마나 가까워 졌는지를 측정 할 방법이 없다는 점이다. (likelihood를 최대화하면, true distribution과 가까워 진다는 것만 알 수 있고, 정확히 얼마나 가까워 졌는지는 측정 할 수 없음)

Approximation

위 식을 optimization 하기 위해서는 $P_{data}$ 를 알아야 하지만, 일반적으로 $P_{data}$ 는 알려져 있지 않다. 따라서 likelihood optimization을 위해 empirical likelihood를 사용한다.

$E_{\mathcal{D}}[\log P_{\theta}(\mathbf{x})]=\frac{1}{|\mathcal{D}|}\underset{\mathbf{x}\in\mathcal{D}}{\sum}\log P_{\theta}(\mathbf{x})$
maximum likelihood learning = $\underset{P_{\theta}}{\max}\frac{1}{|\mathcal{D}|}\underset{\mathbf{x}\in\mathcal{D}}{\sum}\log P_{\theta}(\mathbf{x})$

Monte Carlo Estimation

위 approximation의 기본 아이디어는 monte carlo estimation에 근거한다. monte carlo estimation에 의하면 true expectaion의 estimation은 distribution으로 부터 sampling한 데이터들의 average가 된다.

$E_{\mathbf{x}\sim P}[g(\mathbf{x})]\simeq\frac{1}{T}\underset{t=1}{\overset{T}{\sum}}g(\mathbf{x}^t)\overset{\underset{\mathrm{def}}{}}{=}\hat{g}(\mathbf{x}^1,\cdots,\mathbf{x}^T)$

이때 $\mathbf{x}^1,\cdots,\mathbf{x}^T$ 가 random variable이기 때문에 $\hat{g}(\mathbf{x}^1,\cdots,\mathbf{x}^T)$ 역시 random variable로 볼 수 있다.

Monte carlo estimation은 다음과 같은 특성을 가진다.

Unbiased: $E_P[\hat{g}]=E_P[g(\mathbf{x})]$
Convergence: 큰 수의 법칙에 의해, $\hat{g}\rightarrow E_P[g(\mathbf{x})]\ \mathrm{for}\ T\rightarrow \infty$
Variance: $V_P[\hat{g}]=V_P[\frac{1}{T}\underset{t=1}{\overset{T}{\sum}}g(\mathbf{x}^t)]=\frac{V_P[g(\mathbf{x})]}{T}$

Extending the MLE Principle to Autoregressive Models

MLE를 autoregressive에 적용하면 다음과 같은 식을 얻을 수 있다.

$\log L(\theta,\mathcal{D})=\underset{j=1}{\overset{m}{\sum}}\underset{i=1}{\overset{n}{\sum}}\log p_{neural}(\mathbf{x}_i^j|\mathbf{x}_{\lt i}^j, \theta_i)$

위 식에 대한 closed form solution은 구할 수 없으므로, 위 식을 loss로 하여 gradient descent를 적용해 autoregressive model을 학습 할 수 있다.

Empiricla Risk and Overfitting

MLE는 쉽게 overfitting 된다. 따라서 hypothesis space를 제한하여 overfitting을 방지한다. (regularization) 이때, 과한 regularization은 모델의 representation 능력이 떨어지는 bias가 발생하고, 약한 regularization은 generalization이 떨어지는 variance가 발생한다. 둘은 trade-off 관계이며, 적절한 balance를 이루는 것이 중요하다.