Diffusion Models Beat GANs on Image Synthesis

Seow·2024년 6월 26일
0

Diffusion

목록 보기
4/8

Today, I will introduce well-known paper "Diffusion Models Beat GANs on Image Synthesis."

When this paper was introduced, many people were startled because the diffusion model first completely eclipsed the GAN-based model, Big-GAN.

I do not explain the rigorous formula of a diffusion process in this post. If you do not know about this, check these posts.

Motivations & Contributions ⭐️

Motivations 📚

  • Authors explore that the reason why diffusion models could not surpass GAN-based models for image fidelity.
  • First, GAN architecture has been improved for a long time.
  • Second, GAN-based models are able to trade off diversity for fidelity.

Contributions ⭐️

  • Authors explore and construct an effective model architecture through ablations.
  • They also propose a classifier-guidance method in order to achieve the ability to trade off diversity for fidelity.

Methods

Architecture Improvements

They perform several architectural ablations to find the architecture that achieves a conspicuous performance.

They explore several architecture changes like the number of attention heads, diverse resolutions of attention, employing residual blocks, and so on.

In accordance with above experimental results, they determine the final model: 2 residual block / 2 multiple heads with 64dim per head / attention at 32, 16 and 8 resolutions / employing BigGAN residual blocks for up and downsampling.

In addition, they experiment with adaptive group normalization (AdaGN). This module can incorporate the timestep and class embedding into each residual block after a group normalization.

AdaGN(h,y)=ysGroupNorm(h)+yb\mathrm{AdaGN}(h, y) = y_s \mathrm{GroupNorm}(h)+y_b

They find that the AdaGN module achieves better performance than the simple conditioning method (Addition + GroupNorm).

Resultantly, they also employ the AdaGN modules.

Classifier Guidance

According to authors' claim, an ability to trade off diversity for fidelity is a reason why GAN-based models can achieve superior performance than diffusion-based models. This means that GAN-models can generate high quality samples but not cover the whole data space.

Therefore, they try to incorporate class information with a classifier xtlogpϕ(yxt,t)\nabla_{x_t}\log p_{\phi}(y|x_t ,t).

First of all, they define a reverse process guided by a classifier. (Proof 1)

pθ,ϕ(xtxt+1,y)=Zpθ(xtxt+1)pϕ(yxt)p_{\theta,\phi}(x_t|x_{t+1}, y) = Zp_\theta(x_t|x_{t+1})p_\phi(y|x_t)

The above multiplication term can be divided with log transformation. Thus, we should consider each term.

The first term pθ(xtxt+1)p_\theta(x_t|x_{t+1}) is a unconditional reverse process. So, this term can be represented as the Gaussian distribution N(μ,Σ)N(\mu, \Sigma). If we transform this term through log transformation, this term is :

logpθ(xtxt+1)=12(xtμ)TΣ1(xtμ)+C\log p_\theta(x_t | x_{t+1}) = -{1\over2}(x_t - \mu)^T\Sigma^{-1}(x_t-\mu) + C

The second term pϕ(yxt)p_{\phi}(y|x_t) is a classfier, and we can approximate logpϕ(yxt)\log p_{\phi}(y|x_t) using a Taylor expansion around xt=μx_t = \mu. This is because we can presume that logpϕ(yxt)\log p_\phi(y|x_t) has relatively low curvature compared to Σ1\Sigma^{-1}. (As you know, diffusion process assume that it has infinte diffusion steps. Thus, Σ10||\Sigma^{-1}||\rightarrow 0.)

According to this fact, we can derive the above equation.

As a result, we can conduct the classifier guided diffusion sampling through the above algorithm.

Conditional Sampling for DDIM

In DDIM, they assume that there is no randomness between the adjacent timestep states. Therefore, the sampling method proposed in the above section cannot be used.

To find a conditional sampling method in this setting, authors use a score-based conditioning trick [1]. [1] proposed continuous-time diffusion process using SDE. According to the paper, the conditional reverse process is defined as:

Thus, authors try to represent ϵθ\epsilon_\theta term as a score function xlogpθ(x)\nabla_{x}\log p_\theta(x) like the below equation. (Proof 2)

xtlogpθ(xt)=11αˉtϵθ(xt)\nabla_{x_t}\log p_\theta(x_t) = - {1\over \sqrt{1-\bar \alpha_t}}\epsilon_\theta(x_t)

By this form, they define the conditional reverse process.

Resultantly, they can derive the below conditional ϵ\epsilon term.

The overall algorithm of this method is illustrated in the above figure.

Scaling Classifier Gradients

Authors found it necessary to scale the classifier gradients by a constant factor larger than 1. This is because they observed that the classifier assigned reasonable probabilities (about 50%) to the desired classes for the samples, but the samples did not match the intended classes upon visual inspection.
They found that scaling up the classifier gradients could alleviate this problem.

As the above figures illustrate, when the scale value is large, diversity is heightened, but fidelity is lowered (Precision / Recall).

Results

STATE-OF-THE-ART IMAGE SYNTHESIS

Reference

[1] Song, Yang, et al. "Score-based generative modeling through stochastic differential equations." arXiv preprint arXiv:2011.13456 (2020).

Proof / Appendix

1.

pθ,ϕ(xtxt+1,y)=Zpθ(xtxt+1)pϕ(yxt)p_{\theta,\phi}(x_t|x_{t+1}, y) = Zp_\theta(x_t|x_{t+1})p_\phi(y|x_t)

2

xtlogpθ(xt)=11αˉtϵθ(xt)\nabla_{x_t}\log p_\theta(x_t) = - {1\over \sqrt{1-\bar \alpha_t}}\epsilon_\theta(x_t)

0개의 댓글