Classifier-Free Diffusion Guidance

Seow·2024년 6월 27일
0

Diffusion

목록 보기
5/8
post-thumbnail

Following the previous post, I will introduce the "Classifier-Free Diffusion Guidance" paper.

I presume that you have known the previous paper "Diffusion Models Beat GANs on Image Synthesis". If you do not know this paper well, you should refer to this post: https://velog.io/@yhyj1001/Diffusion-Models-Beat-GANs-on-Image-Synthesis.

I will call the name of the previous paper's method "ADM-G," which means "Ablated Diffusion Model-Guidance."

Motivations & Contributions ⭐️

Motivations 📚

  • ADM-G has needed an auxiliary classifier model qϕ(cxt)q_\phi (c|x_t). This means that we have to pre-train this classifier with decayed data xtx_t. Resultantly, this has caused a complex training pipeline.
  • Inception Score (IS) and Frechet Inception Distance (FID) are computed using the classifier model, the Inception model. Authors contend that classifier-guided diffusion sampling can be understood as an effort to perturb an image classifier using a gradient-based adversarial attack to create confusion because this method employs both a score estimate and a gradient of the classifier model.

Contributions ⭐️

  • Authors propose a classifier-free diffusion guidance.
  • They show that diffusion models are able to trade off diversity for fidelity without an auxiliary classifier.

Methods

Classifier Guidance

In the previous post, we defined a conditional reverse process with a scale factor ww.
In this paper, authors use the SNR parameter λ\lambda because they define the continuous variance-preserving diffusion process.

zλlogp~θ(zλc)=zλ[logp(zλc)+wlogpθ(czλ)]=1σλ[ϵθ(zλ,c)wσλlogpθ(czλ)]\nabla_{z_\lambda}\log \tilde p_\theta(z_\lambda|c) = \nabla_{z_\lambda}[\log p(z_\lambda | c) + w \log p_\theta(c|z_\lambda)] = \\[4pt]{1\over \sigma_\lambda}[\epsilon_\theta(z_\lambda, c)-w\sigma_\lambda \log {p_\theta}(c|z_\lambda)]

First of all, they derive a conditional reverse process as the above equation. According to the previous paper, ϵθ(xt)=1σλxλlogp(zλc)\epsilon_\theta(x_t) = {1\over \sigma_\lambda}\nabla_{x_\lambda}\log p(z_\lambda | c) is satisfied. In addition, the reason why this term is defined with p(zλc)p(z_\lambda | c) is that the model gains the conditional factor using the AdaGN module.

ϵ~θ(zλ,c)=ϵθ(zλ)(w+1)σλzλlogpθ(czλ)=σλzλ[logp(zλc)+wlogpθ(czλ)]=σλzλ[logp(zλ)+(1+w)logpθ(czλ)]\tilde \epsilon_\theta(z_\lambda, c)= \epsilon_\theta (z_\lambda)-(w+1)\sigma_\lambda \nabla_{z_\lambda}\log p_\theta(c|z_\lambda) \\[4pt]= -\sigma_\lambda\nabla_{z_\lambda} [\log p(z_\lambda|c)+w\log p_\theta(c|z_\lambda)] \\[4pt]= -\sigma_\lambda\nabla_{z_\lambda} [\log p(z_\lambda)+(1+w)\log p_\theta(c|z_\lambda)]

Resultantly, authors can represent the above term when ϵ~θ(zλ,c)=σλzλlogp~θ(zλc)\tilde \epsilon_\theta(z_\lambda, c) = -\sigma_\lambda \nabla_{z_\lambda}\log \tilde p_\theta(z_\lambda|c). (Proof 1)

Classifier-Free Guidance

In this section, the authors propose the classifier-free guidance method for a conditional reverse process.

They were inspired by the gradient of the implicit classifier definition: (Proof 2)

zλlogp(czλ)=1σλ[ϵ(zλ,c)ϵ(zλ)]\nabla_{z_\lambda}\log p(c|z_\lambda) = -{1\over \sigma_\lambda}[\epsilon^*(z_\lambda, c) - \epsilon^*(z_\lambda)]

By the above definition, they derive the linear combination of the conditional and unconditional score estimates:

ϵ~θ(zλ,c)=(1+w)ϵθ(zλ,c)wϵθ(zλ)\tilde\epsilon_\theta(z_\lambda, c) = (1+w)\epsilon_\theta(z_\lambda, c) - w\epsilon_\theta(z_\lambda)

The above term can be derived using all equations in this post. (Proof 3)

In fact, the term (Proof 2) assumes that the score estimate should be the optimal value. However, the authors claim that they empirically found that the non-optimal term has performed well, even if naively defined.

Based on the above definition, they illustrate the classifier-free guidance algorithm:

Results

They prove that the classifier-free guidance method can trade off diversity for fidelity with the scale factor and the unconditional training probability. As the above table shows, they can achieve the best fidelity but poor diversity when using the lowest unconditional training probability and the highest scale factor (high FID value / high IS value).

Proof / Appendix

1.

2.

3

0개의 댓글