Cs236 Lecture9

JInwoo·2025년 1월 18일

cs236

목록 보기

7/15

Towards Likelihood-free learning

Maximum Likelihood 방식의 학습은 빠른 학습 속도를 갖는다. 하지만 높은 likelihood가 항상 좋은 quality의 sample 생성을 보장하지는 않는다. 반대로 좋은 sample이 낮은 likelihood를 갖는 경우도 있다. (overfitting의 경우)

Comparing Distributions via Samples

$S_1=\{\mathbf{x}\sim P\}$ , $S_2=\{\mathbf{x}\sim Q\}$ 이 각각 다른 distribution에서 얻어진 samples라고 할 때, samples로 부터 두 distribution을 비교하는 방법으로 Two-sample test 을 사용할 수 있다.

Two-sample test는 다음 두 개의 hypotheses를 고려한다.

Null hypothesis: $H_0: P=Q$
Alternative hypothesis: $H_1: P\ne Q$

test는 두 samples 집단을 비교하여 statistic $T$ 를 구하고, $T$ 가 임계값 이상이면 크면 $H_0$ 기각, 아닌 경우 $H_0$ 을 지지한다. 즉, test objective $T$ 를 통해 두 distribution을 비교할 수 있다. 다음은 두 samples의 mean과 variance를 차이를 확인하는 test objective 예시다.

$T(S_1, S_2)=|\frac{1}{|S_1|}\sum_{\mathbf{x}\in S_1}\mathbf{x}-\frac{1}{|S_2|}\sum_{\mathbf{x}\in S_2}\mathbf{x}|$

위 test에서 주목할 점은 likelihood-free 라는 점이다. $T$ statistic을 증가시키는 방향으로 모델을 학습하면, likelihood 없이 generative model을 학습 할 수 있다.

그러나 high dimension인 variables(예를들면 image)에서 two-sample test는 좋은 objective를 찾기 어렵다. 예를 들어 앞서 본 mean, variance만을 비교하는 objective는 high dimension에서 좋은 objective로 보기 어렵다. 왜냐하면 mean과 variance가 같더라도 충분히 다른 distribution일 수 있기 때문이다.

이에 대한 대안으로 $S_1, S_2$ 가 서로 다른지 구별하는 classifier를 학습하는 방식이 제안된다. (discriminator)

Two-Sample Test vis Discriminator

Discriminator를 이용한 two-sample test는 test statistic을 discriminator의 -loss로 설정할 수 있다. 낮은 discriminator의 loss는 두 samples 집단의 차이를 알기 쉽다는 것을 의미하고, 높은 discriminator의 loss는 두 samples 집단의 차이를 구별하기 어렵다는 것을 의미하기 때문이다. generative model의 관점에서 보면 two-sample test의 statistic을 높여 alternative hypothesis를 채택하도록 하고 싶다. 따라서 discriminator의 loss를 줄여야한다. 이를 다음과 같이 쓸 수 있다.

$\underset{D_\phi}{\max}V(p_\theta, D_\phi)=E_{\mathbf{x}\sim p_{data}}[\log D_\phi(\mathbf{x})]+E_{\mathbf{x}\sim p_\theta}[\log (1-D_\phi(\mathbf{x})]\approx\underset{\mathbf{x}\in S_1}{\sum}\log D_\phi(\mathbf{x})+\underset{\mathbf{x}\in S_2}{\sum}\log [(1-D_\phi(\mathbf{x})]$

위 식의 $p_\theta$ 는 generative model( $S_2)$ 이라고 볼 수 있고, $p_{data}$ 는 실제 데이터 분포( $S_1$ )라고 볼 수 있다. 또한 discrimantor의 최적 값은 다음과 같다.

$D^*_\theta(\mathbf{x})=\frac{p_{data}(\mathbf{x})}{p_{data}(\mathbf{x})+p_\theta(\mathbf{x})}$

즉, 최상의 시나리오인 $p_{data}=p_\theta$ 에서 discriminator는 1/2 확률로 두 sample을 구별할 수 있게 된다.(cannot do better than chance)

Generative Adversarial Networks

앞서 본 $p_\theta$ 를 generative model로 볼 수 있고 이를 generator( $G$ )라고 부른다. generator는 $\mathbf{z}$ 를 input으로 받아 데이터를 생성한다.

GAN은 generator와 discriminator의 minmax game 형태이다.

$\underset{G}{\min}\underset{D}{\max}V(G,D)=E_{\mathbf{x}\sim p_{data}}[\log D(\mathbf{x})]-E_{\mathbf{x}\sim p_G}[\log (1 - D(\mathbf{x}))]$

Generator는 discriminator가 최대한 두 samples 집단을 구별 못하도록 하고 싶어하고, discrimiantor는 generator가 생성한 samples를 잘 구별하고 싶어한다.

The GAN Training Algorithm

GAN의 training은 다음과 같이 이루어 진다.

sample minibatch of $m$ traning points $\mathbf{x}^{(1)},\mathbf{x}^{(2)},\cdots,\mathbf{x}^{(m)}$ from $\mathcal{D}$
sample minibatch of $m$ noise vectors $\mathbf{z}^{(1)}, \mathbf{z}^{(2)},\cdots,\mathbf{z}^{(m)}$ from $p_\mathbf{z}$ (일반적으로 gaussian)
update discriminator parameters $\phi$ by gradient ascent
$\nabla_\phi V(G_\theta,V_\phi)=\frac{1}{m}\nabla_\phi\underset{i=1}{\overset{m}{\sum}}[\log D_\phi(\mathbf{x}^{i)})+\log (1-D_\phi(G_\theta(\mathbf{z}^{(i)})))]$
update generator parameter $\theta$ by gradient descent
$\nabla_\theta V(G_\theta,V_\phi)=\frac{1}{m}\nabla_\theta\underset{i=1}{\overset{m}{\sum}}\log (1-D_\phi(G_\theta(\mathbf{z}^{(i)})))$
repeat for fixed number of epchos

정리하면 다음과 같은 function을 얻을 수 있다.

$\underset{\theta}{\min}\underset{\phi}{\max}V(G_\theta,D_\phi)=E_{\mathbf{x}\sim p_{data}}[\log D_\phi(\mathbf{x})]-E_{\mathbf{z}\sim p(\mathbf{z})}[\log (1 - D_\phi(G_\theta(\mathbf{z})))]$