Diffusion: Conditioned Generation

Junha Park·2023년 2월 5일

Advanced Computer Vision

목록 보기

2/2

This article is about conditioned generation with diffusion model.
Diffusion Models Beat GANs on Image Synthesis is the first work of guiding image synthesis with classifier guidance, while recent advancement includes another approaches such as classifier-free guidance.

1. Recap: Diffusion model

Generative modeling in diffusion model can be understood by learning reversed gradual noising process, intuitively corresponds to the denoising action. Ho et al. formulated the variational lower bound loss into a simple training objecive which is a mean-squared error loss between the true noise and the predicted noise.

Remark:
Estimating a noise also has a linkage with Song et al., which aims at estimating gradients of the data distribution via score matching.

Sampling from a noise predictor $\epsilon_\theta(x^{(t)},t)$ can be done by learning a gaussian transition kernel of reverse diffusion process, where the mean $\mu_\theta(x^{(t)},t)$ can be calculated as a function of $\epsilon_\theta(x^{(t)},t)$ while variance $\Sigma_\theta(x^{(t)},t)$ can be fixed into a known constant or learned with separate neural network.

There were some improvements after Ho et al., for instance Nichol and Dhariwal found that fixing the variance $\Sigma_{\theta}(x^{(t)},t)$ is suboptimal for sampling with fewer diffusion steps. They proposed a parametrization of $\Sigma_{\theta}(x^{(t)},t)$ by interpolating between $\beta_t, \tilde{\beta_t}$ with learned scalar $v$ , i.e.

\Sigma_{\theta}(x^{(t)},t)=\exp(v\log\beta_t+(1-v)\log\tilde{\beta_t})

2. Classifier guidance

GANs for conditional image synthesis heavily exploit class labels to add class-conditoinal normalization statistics as well as "discriminators" which explicitly behave like classifiers $p(y|x)$ . Conditional diffusion models resemble this philosophy, especially incorporate class information into normalization layers and exploit classifier to improve a diffusion generator.

Simply, adaptive group normalization(AdaGN) layer is proposed to incorporate class information into normalization layers by explicitly performing a transformation of

\text{AdaGN(h,y)} = y_s \text{GroupNorm}(h)+y_b,\\ \text{where } y=[y_s,y_b] \text{ obtained from timestep and class embedding}

In this paper, authors introduce two methods of conditional sampling using classifiers. Each of the method is applied at DDPM and DDIM, respectively.

Conditional Reverse Noising Process

x^{(t-1)} \sim \mathcal{N}(\mu+s\cdot\Sigma\nabla_{x^{(t)}}\log p_\phi(y|x^{(t)}) , \Sigma)

Classifier guided diffusion sampling follows the equation above. In this section, we will discuss about mathematical justification of this sampling process.

Unconditional reverse noising process at conventional diffusion model, $p_\theta(x^{(t)}|x^{(t+1)})$ should be conditioned on a label $y$ . We can think of a classifier $p_\phi(y|x)$ to guide each transition into $p_{\theta,\phi}(x^{(t)}|x^{(t+1)},y)=Zp_\theta(x^{(t)}|x^{(t+1)})p_\phi(y|x^{(t)})$ , where $Z$ is a normalization constant. It is mathematically intractable for the most cases, but it can be approximated as a perturbed guassian distribution with shifted mean.

p_\theta(x^{(t)}|x^{(t+1)})=\mathcal{N}(\mu,\Sigma) \\ \log p_\theta(x^{(t)}|x^{(t+1)})=-\frac{1}{2}(x^{(t)}-\mu)^T\Sigma^{-1}(x^{(t)}-\mu)+C

We can approximate the log of class probability $\log p_\phi(y|x^{(t)})$ with taylor expansion under assumption that $\log p_\phi(y|x^{(t)})$ has low curvature compared to $\Sigma^{-1}$ .

\log p_\phi(y|x^{(t)})\approx\log p_\phi(y|x^{(t)})|_{x^{(t)}=\mu}+(x^{(t)}-\mu)\nabla_{x^{(t)}}\log p_\phi(y|x^{(t)})|_{x^{(t)}=\mu}

Here, we substitute $g=\nabla_{x^{(t)}}{\log p_\phi(y|x^{(t)})|_{x^{(t)}=\mu}}$ to obtain $\log p_\phi(y|x^{(t)})\approx(x^{(t)}-\mu)g+C$ .

This gives

\log(p_\theta(x^{(t)}|x^{(t+1)})p_\phi(y|x^{(t)})) \approx -\frac{1}{2}(x^{(t)}-\mu)^T\Sigma^{-1}(x^{(t)}-\mu)+(x^{(t)}-\mu)g+C \\ =\frac{1}{2}(x^{(t)}-\mu-\Sigma g)^T\Sigma^{-1}(x^{(t)}-\mu-\Sigma g)+C'\\ =\log p(z)+C'', z \sim \mathcal{N}(\mu+\Sigma g,\Sigma)

In detail, authors introduced an additional parameter $s$ to rescale classifier gradients. So, the clasifier guided diffusion sampling can be done through the following process.

Conditional Sampling for DDIM

Previosly described conditional sampling is invalid for deterministic sampling methods like DDIM. We can adopt a score-based conditioning trick to derive a new formulation of estimated score.

At score-based generative modeling, we have a noise predictor $\epsilon_\theta(x^{(t)},t)$ which is used to derive a score function

\nabla_{x^{(t)}}\log p_\theta(x^{(t)})=-\frac{1}{\sqrt{1-\bar{\alpha_t}}}\epsilon_\theta(x^{(t)})

We can derive a score function for $p(x^{(t)})p(y|x^{(t)})$

\nabla_{x^{(t)}}\log (p_\theta(x^{(t)} p_\phi(y|x^{(t)}))=\nabla_{x^{(t)}}\log p_\theta(x^{(t)})+\nabla_{x^{(t)}}\log p_\phi(y|x^{(t)})= -\frac{1}{\sqrt{1-\bar{\alpha_t}}}\epsilon_\theta(x^{(t)})+\nabla_{x^{(t)}}\log p_\phi(y|x^{(t)})

We obtain a new predictor $\hat{\epsilon(x^{(t)})}$ which corresponds to the score of joint distribuiton: $\hat{\epsilon(x^{(t)})} = \epsilon_\theta(x^{(t)})-\sqrt{1-\bar{\alpha_t}}\nabla_{x^{(t)}}\log p_\phi(y|x^{(t)})$ . Thus, conditional sapmling for DDIM can be done with the following process.

3. Next article

In this article, we discussed about conditional generation with diffusion model. DDPM have been already discussed in the previous article, but DDIM and the concept of score-matching is weakly introduced.
So in the next article, we'll be able to reveal the relationship between DDPM and energy-based models, including noise-conditoned score networks(NCSN) proposed at Song et al. and DDIM.

Junha Park

interested in 🖥️,🧠,🧬,⚛️

이전 포스트