Diffusion Models Beat GANs on Image Synthesis

Seow·2024년 6월 26일

Diffusion

목록 보기

4/8

Today, I will introduce well-known paper "Diffusion Models Beat GANs on Image Synthesis."

When this paper was introduced, many people were startled because the diffusion model first completely eclipsed the GAN-based model, Big-GAN.

I do not explain the rigorous formula of a diffusion process in this post. If you do not know about this, check these posts.

DPM : https://velog.io/@yhyj1001/DPM-Deep-Unsupervised-Learning-using-Nonequilibrium-Thermodynamics
DDPM : https://velog.io/@yhyj1001/DDPM-Denoising-Diffusion-Probabilistic-Models
Lil'Log diffusion process : https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

Motivations & Contributions ⭐️

Motivations 📚

Authors explore that the reason why diffusion models could not surpass GAN-based models for image fidelity.
First, GAN architecture has been improved for a long time.
Second, GAN-based models are able to trade off diversity for fidelity.

Contributions ⭐️

Authors explore and construct an effective model architecture through ablations.
They also propose a classifier-guidance method in order to achieve the ability to trade off diversity for fidelity.

Methods

Architecture Improvements

They perform several architectural ablations to find the architecture that achieves a conspicuous performance.

They explore several architecture changes like the number of attention heads, diverse resolutions of attention, employing residual blocks, and so on.

In accordance with above experimental results, they determine the final model: 2 residual block / 2 multiple heads with 64dim per head / attention at 32, 16 and 8 resolutions / employing BigGAN residual blocks for up and downsampling.

In addition, they experiment with adaptive group normalization (AdaGN). This module can incorporate the timestep and class embedding into each residual block after a group normalization.

\mathrm{AdaGN}(h, y) = y_s \mathrm{GroupNorm}(h)+y_b

They find that the AdaGN module achieves better performance than the simple conditioning method (Addition + GroupNorm).

Resultantly, they also employ the AdaGN modules.

Classifier Guidance

According to authors' claim, an ability to trade off diversity for fidelity is a reason why GAN-based models can achieve superior performance than diffusion-based models. This means that GAN-models can generate high quality samples but not cover the whole data space.

Therefore, they try to incorporate class information with a classifier $\nabla_{x_t}\log p_{\phi}(y|x_t ,t)$ .

First of all, they define a reverse process guided by a classifier. (Proof 1)

p_{\theta,\phi}(x_t|x_{t+1}, y) = Zp_\theta(x_t|x_{t+1})p_\phi(y|x_t)

The above multiplication term can be divided with log transformation. Thus, we should consider each term.

The first term $p_\theta(x_t|x_{t+1})$ is a unconditional reverse process. So, this term can be represented as the Gaussian distribution $N(\mu, \Sigma)$ . If we transform this term through log transformation, this term is :

\log p_\theta(x_t | x_{t+1}) = -{1\over2}(x_t - \mu)^T\Sigma^{-1}(x_t-\mu) + C

The second term $p_{\phi}(y|x_t)$ is a classfier, and we can approximate $\log p_{\phi}(y|x_t)$ using a Taylor expansion around $x_t = \mu$ . This is because we can presume that $\log p_\phi(y|x_t)$ has relatively low curvature compared to $\Sigma^{-1}$ . (As you know, diffusion process assume that it has infinte diffusion steps. Thus, $||\Sigma^{-1}||\rightarrow 0$ .)

According to this fact, we can derive the above equation.

As a result, we can conduct the classifier guided diffusion sampling through the above algorithm.

Conditional Sampling for DDIM

In DDIM, they assume that there is no randomness between the adjacent timestep states. Therefore, the sampling method proposed in the above section cannot be used.

To find a conditional sampling method in this setting, authors use a score-based conditioning trick [1]. [1] proposed continuous-time diffusion process using SDE. According to the paper, the conditional reverse process is defined as:

Thus, authors try to represent $\epsilon_\theta$ term as a score function $\nabla_{x}\log p_\theta(x)$ like the below equation. (Proof 2)

\nabla_{x_t}\log p_\theta(x_t) = - {1\over \sqrt{1-\bar \alpha_t}}\epsilon_\theta(x_t)

By this form, they define the conditional reverse process.

Resultantly, they can derive the below conditional $\epsilon$ term.

The overall algorithm of this method is illustrated in the above figure.

Scaling Classifier Gradients

Authors found it necessary to scale the classifier gradients by a constant factor larger than 1. This is because they observed that the classifier assigned reasonable probabilities (about 50%) to the desired classes for the samples, but the samples did not match the intended classes upon visual inspection.
They found that scaling up the classifier gradients could alleviate this problem.