WGAN

HeyHo·2022년 12월 8일

GAN

목록 보기

1/1

- WGAN

WGAN propose using the Wassertein distance to measure the distance between the two distribution.

1. Why Wasserstein is better than Js or KL?

(1) suppose that we have two probaility distribution P and Q

다음과 같이 distribution이 겹치지 않는 두 개의 joint propability distribution이 있다고 가정해보자.

$\forall(x, y) \in P, x=0$ and $y \sim U(0,1)$
$\forall(x, y) \in Q, x=\theta, 0 \leq \theta \leq 1$ , and $y \sim U(0,1)$

(2) when $\theta \neq 0$

다음과 같은 조건에서 KL divergence 와 JS divergence를 구해보면,

$D_{K L}(P \| Q)=\sum_{x=0, y \sim U(0,1)} 1 \cdot \log \frac{1}{0}=+\infty$
$D_{K L}(Q \| P)=\sum_{x=\theta, y \sim U(0,1)} 1 \cdot \log \frac{1}{0}=+\infty$
$D_{J S}(P, Q)=\frac{1}{2}\left(\sum_{x=0, y \sim U(0,1)} 1 \cdot \log \frac{1}{\frac{1}{2}}+\sum_{x=0, y \sim U(0,1)} 1 \cdot \log \frac{1}{\frac{1}{2}}\right)=\log 2$
으로 Gradient가 0인 것을 확인할 수 있다.

하지만, Wasserstein distance의 경우에는 2개의 distribution을 가장 가까이 겹치게 하는 방법은 line을 따라서 옮기는 것이므로,

$W(P,Q) = |\theta|$

(3) when $\theta = 0$

$D_{K L}(P \| Q)=D_{K L}(Q \| P)=D_{J S}(P, Q)=0$
$W(P,Q) = 0 = |\theta|$

(2), (3)의 경우를 통해서 보았을 때, wasserstien distance가 KL divergence와 JS divergence보다 훨씬 smooth한 distance measure를 가지고 있는 것을 알 수 있다.

2. Kantorovich-Rubinstein duality

우리는 지금까지 2개의 distribution이 겹치지 않는 경우에 Wasserstien distance가 KL, JS divergence 보다 distribution measure 측면에서 훨씬 유의미한 정보를 제공하는 것을 확인하였다.
하지만, $W\left(p_r, p_\theta\right)=\inf _{\gamma \sim \Pi\left(p_r, p_\theta\right)} \mathbb{E}_{(x, y) \sim \gamma}[\|x-y\|]$ 에서
${\gamma \sim \Pi\left(p_r, p_\theta\right)}$ 는 가능한 모든 joint distribution의 set이기 때문에,
이를 모두 고려하여 wassertein distance를 구한다는 것은 사실상 불가능하다.
이러한 primal problem을 풀기 위해 dual problem으로 바꿔서 문제를 해결하는 Kantorovish-Rubinstein duality가 등장한다.

(1) Highly intractable term in inf

$W\left(\mathbb{P}_r, \mathbb{P}_g\right)=\inf _{\gamma \in \Pi\left(\mathbb{P}_r, \mathbb{P}_g\right)} \mathbb{E}_{(x, y) \sim \gamma}[\|x-y\|]$ 는 highly intractable하다.
따라서 Kantorovish-Rubinstein duality를 통해, Lipschitz continuous 조건을 만족하는 '어떠한' function $f: X \rightarrow R$ 를 통해서 다음과 같은 dual problem으로 문제를 해결한다.

$W\left(p_r, p_\theta\right)=\frac{1}{K} \sup _{\|f\|_{L \leq K}} \mathbb{E}_{x \sim p_r}[f(x)]-\mathbb{E}_{x \sim p_\theta}[f(x)]$

이렇게 Wasserstein distance를 duaility를 통해서 다르게 정의할 수 있다. 그렇다면 $f$ 는 무엇인가?

(2) Find optimal $f(x)$

$f$ 는 말 그대로, $\mathbb{E}_{x \sim p_r}[f(x)]-\mathbb{E}_{x \sim p_\theta}[f(x)]$ 의 값을 최대로 만족하는 '어떠한' function $f$ 이다.

이러한 $f$ 를 잘 '추정' 하기 위해서 parameter $w$ 를 가지는 neural network를 사용하여 다음과 같은 수식을 만족시키는 $f_w$ 를 추정해준다. (neural network 는 universal function approximator 이기 때문에 neural net을 사용하여 $f$ 를 추정한다.)
$\max _{w \in W} \mathbb{E}_{x \sim p_r}\left[f_w(x)\right]-\mathbb{E}_{x \sim p_\theta}\left[f_w(x)\right] \leq \sup _{\|f\|_{L \leq K}} \mathbb{E}_{x \sim p_r}[f(x)]-\mathbb{E}_{x \sim p_\theta}[f(x)]=K \cdot W(P_r, P_\theta)$

우리는 그저 sup 형태로 존재하는 $f(x)$ 를 잘 추정하면 되기 때문에, 굳이 $K$ 에 대해 자세하게 알 필요가 없다.

$\max _{w \in W} \mathbb{E}_{x \sim p_r}\left[f_w(x)\right]-\mathbb{E}_{x \sim p_\theta}\left[f_w(x)\right]$ 을 만족하는 $f_w(x)$ 를 찾기 위해 gradient를 구해준다.
$\nabla_w[f_w(x)-f(g_\theta(z))]$ 를 통해서 sup problem의 solution인 $f(x)$ 를 추정해주는 $f_w(x)$ 의 parameter $w$ 를 update한다.

(3) Generator update process

Neural network를 통해서 $f_w$ 를 구한 다음, Generator의 parameter $\theta$ 를 update 시켜준다.

$\begin{aligned} \nabla_\theta W\left(p_r, p_g\right) & =\nabla_\theta\left(\mathbb{E}_{x \sim p_r}\left[f_w(x)\right]-\mathbb{E}_{z \sim Z}\left[f_w\left(g_\theta(z)\right)\right]\right) \\ & =-\mathbb{E}_{z \sim Z}\left[\nabla_\theta f_w\left(g_\theta(z)\right)\right]\end{aligned}$

(4) Weight Clipping

초기에 sup problem을 생각해보면, Kantorovish-Rubinstein duality를 통해서 Primal problem을 dual problem으로 끌고 간다. 이 때, Lipschitz Continuous를 만족해야 한다는 조건이 붙는다. Neural Network를 통해서 $f_w(x)$ 를 추정할 때, $f_w(x)$ 또한 Lipschitz Continuous를 만족해야 한다. Neural network에서 $f_w$ 의 Gradient는 곧 weight이기 때문에, Weight Clipping 해주면 그 것 자체로 Lipschitz Continuous를 만족시키는 것이다. 따라서, Weight Clipping을 통해서 Lipschitz Countinuous를 만족시켜준다.

(5) Total training process

전체적인 Process의 pseudo code이다.
1. 우선, Earth Mover's distance optimization problem을 dual problem으로 바꾸어준다. 그 다음, 고정된 $\theta$ 에서 Dual problem solution의 approximated function $f_w(x)$ 를 training을 통해서서 찾아준다.
2. Wasserstein distance를 backprop해서 generator의 파라미터를 update 시켜준다.

HeyHo

Coputer vision, AI

WGAN

GAN

- WGAN

1. Why Wasserstein is better than Js or KL?

(1) suppose that we have two probaility distribution P and Q

(2) when $\theta \neq 0$

(3) when $\theta = 0$

2. Kantorovich-Rubinstein duality

(1) Highly intractable term in inf

이렇게 Wasserstein distance를 duaility를 통해서 다르게 정의할 수 있다. 그렇다면 $f$ 는 무엇인가?

(2) Find optimal $f(x)$

$f$ 는 말 그대로, $\mathbb{E}_{x \sim p_r}[f(x)]-\mathbb{E}_{x \sim p_\theta}[f(x)]$ 의 값을 최대로 만족하는 '어떠한' function $f$ 이다.

(3) Generator update process

Neural network를 통해서 $f_w$ 를 구한 다음, Generator의 parameter $\theta$ 를 update 시켜준다.

(4) Weight Clipping

(5) Total training process

0개의 댓글

관련 채용 정보

WGAN

GAN

- WGAN

1. Why Wasserstein is better than Js or KL?

(1) suppose that we have two probaility distribution P and Q

(2) when θ≠0\theta \neq 0θ​=0

(3) when θ=0\theta = 0θ=0

2. Kantorovich-Rubinstein duality

(1) Highly intractable term in inf

이렇게 Wasserstein distance를 duaility를 통해서 다르게 정의할 수 있다. 그렇다면 fff는 무엇인가?

(2) Find optimal f(x)f(x)f(x)

fff는 말 그대로, Ex∼pr[f(x)]−Ex∼pθ[f(x)]\mathbb{E}_{x \sim p_r}[f(x)]-\mathbb{E}_{x \sim p_\theta}[f(x)]Ex∼pr​​[f(x)]−Ex∼pθ​​[f(x)]의 값을 최대로 만족하는 '어떠한' function fff이다.

(3) Generator update process

Neural network를 통해서 fwf_wfw​를 구한 다음, Generator의 parameter θ\thetaθ를 update 시켜준다.

(4) Weight Clipping

(5) Total training process

0개의 댓글

관련 채용 정보

(2) when $\theta \neq 0$

(3) when $\theta = 0$

이렇게 Wasserstein distance를 duaility를 통해서 다르게 정의할 수 있다. 그렇다면 $f$ 는 무엇인가?

(2) Find optimal $f(x)$

$f$ 는 말 그대로, $\mathbb{E}_{x \sim p_r}[f(x)]-\mathbb{E}_{x \sim p_\theta}[f(x)]$ 의 값을 최대로 만족하는 '어떠한' function $f$ 이다.

Neural network를 통해서 $f_w$ 를 구한 다음, Generator의 parameter $\theta$ 를 update 시켜준다.