Diffusion: Conditioned Generation

Junha Park·2023년 2월 5일
0

Advanced Computer Vision

목록 보기
2/2
post-thumbnail

This article is about conditioned generation with diffusion model.
Diffusion Models Beat GANs on Image Synthesis is the first work of guiding image synthesis with classifier guidance, while recent advancement includes another approaches such as classifier-free guidance.

1. Recap: Diffusion model

Generative modeling in diffusion model can be understood by learning reversed gradual noising process, intuitively corresponds to the denoising action. Ho et al. formulated the variational lower bound loss into a simple training objecive which is a mean-squared error loss between the true noise and the predicted noise.

Remark:
Estimating a noise also has a linkage with Song et al., which aims at estimating gradients of the data distribution via score matching.

Sampling from a noise predictor ϵθ(x(t),t)\epsilon_\theta(x^{(t)},t) can be done by learning a gaussian transition kernel of reverse diffusion process, where the mean μθ(x(t),t)\mu_\theta(x^{(t)},t) can be calculated as a function of ϵθ(x(t),t)\epsilon_\theta(x^{(t)},t) while variance Σθ(x(t),t)\Sigma_\theta(x^{(t)},t) can be fixed into a known constant or learned with separate neural network.

There were some improvements after Ho et al., for instance Nichol and Dhariwal found that fixing the variance Σθ(x(t),t)\Sigma_{\theta}(x^{(t)},t) is suboptimal for sampling with fewer diffusion steps. They proposed a parametrization of Σθ(x(t),t)\Sigma_{\theta}(x^{(t)},t) by interpolating between βt,βt~\beta_t, \tilde{\beta_t} with learned scalar vv, i.e.

Σθ(x(t),t)=exp(vlogβt+(1v)logβt~)\Sigma_{\theta}(x^{(t)},t)=\exp(v\log\beta_t+(1-v)\log\tilde{\beta_t})

2. Classifier guidance

GANs for conditional image synthesis heavily exploit class labels to add class-conditoinal normalization statistics as well as "discriminators" which explicitly behave like classifiers p(yx)p(y|x). Conditional diffusion models resemble this philosophy, especially incorporate class information into normalization layers and exploit classifier to improve a diffusion generator.

Simply, adaptive group normalization(AdaGN) layer is proposed to incorporate class information into normalization layers by explicitly performing a transformation of

AdaGN(h,y)=ysGroupNorm(h)+yb,where y=[ys,yb] obtained from timestep and class embedding\text{AdaGN(h,y)} = y_s \text{GroupNorm}(h)+y_b,\\ \text{where } y=[y_s,y_b] \text{ obtained from timestep and class embedding}

In this paper, authors introduce two methods of conditional sampling using classifiers. Each of the method is applied at DDPM and DDIM, respectively.

Conditional Reverse Noising Process

x(t1)N(μ+sΣx(t)logpϕ(yx(t)),Σ)x^{(t-1)} \sim \mathcal{N}(\mu+s\cdot\Sigma\nabla_{x^{(t)}}\log p_\phi(y|x^{(t)}) , \Sigma)

Classifier guided diffusion sampling follows the equation above. In this section, we will discuss about mathematical justification of this sampling process.

Unconditional reverse noising process at conventional diffusion model, pθ(x(t)x(t+1))p_\theta(x^{(t)}|x^{(t+1)}) should be conditioned on a label yy. We can think of a classifier pϕ(yx)p_\phi(y|x) to guide each transition into pθ,ϕ(x(t)x(t+1),y)=Zpθ(x(t)x(t+1))pϕ(yx(t))p_{\theta,\phi}(x^{(t)}|x^{(t+1)},y)=Zp_\theta(x^{(t)}|x^{(t+1)})p_\phi(y|x^{(t)}), where ZZ is a normalization constant. It is mathematically intractable for the most cases, but it can be approximated as a perturbed guassian distribution with shifted mean.

pθ(x(t)x(t+1))=N(μ,Σ)logpθ(x(t)x(t+1))=12(x(t)μ)TΣ1(x(t)μ)+Cp_\theta(x^{(t)}|x^{(t+1)})=\mathcal{N}(\mu,\Sigma) \\ \log p_\theta(x^{(t)}|x^{(t+1)})=-\frac{1}{2}(x^{(t)}-\mu)^T\Sigma^{-1}(x^{(t)}-\mu)+C

We can approximate the log of class probability logpϕ(yx(t))\log p_\phi(y|x^{(t)}) with taylor expansion under assumption that logpϕ(yx(t))\log p_\phi(y|x^{(t)}) has low curvature compared to Σ1\Sigma^{-1}.

logpϕ(yx(t))logpϕ(yx(t))x(t)=μ+(x(t)μ)x(t)logpϕ(yx(t))x(t)=μ\log p_\phi(y|x^{(t)})\approx\log p_\phi(y|x^{(t)})|_{x^{(t)}=\mu}+(x^{(t)}-\mu)\nabla_{x^{(t)}}\log p_\phi(y|x^{(t)})|_{x^{(t)}=\mu}

Here, we substitute g=x(t)logpϕ(yx(t))x(t)=μg=\nabla_{x^{(t)}}{\log p_\phi(y|x^{(t)})|_{x^{(t)}=\mu}} to obtain logpϕ(yx(t))(x(t)μ)g+C\log p_\phi(y|x^{(t)})\approx(x^{(t)}-\mu)g+C.

This gives

log(pθ(x(t)x(t+1))pϕ(yx(t)))12(x(t)μ)TΣ1(x(t)μ)+(x(t)μ)g+C=12(x(t)μΣg)TΣ1(x(t)μΣg)+C=logp(z)+C,zN(μ+Σg,Σ)\log(p_\theta(x^{(t)}|x^{(t+1)})p_\phi(y|x^{(t)})) \approx -\frac{1}{2}(x^{(t)}-\mu)^T\Sigma^{-1}(x^{(t)}-\mu)+(x^{(t)}-\mu)g+C \\ =\frac{1}{2}(x^{(t)}-\mu-\Sigma g)^T\Sigma^{-1}(x^{(t)}-\mu-\Sigma g)+C'\\ =\log p(z)+C'', z \sim \mathcal{N}(\mu+\Sigma g,\Sigma)

In detail, authors introduced an additional parameter ss to rescale classifier gradients. So, the clasifier guided diffusion sampling can be done through the following process.

Conditional Sampling for DDIM

Previosly described conditional sampling is invalid for deterministic sampling methods like DDIM. We can adopt a score-based conditioning trick to derive a new formulation of estimated score.

At score-based generative modeling, we have a noise predictor ϵθ(x(t),t)\epsilon_\theta(x^{(t)},t) which is used to derive a score function

x(t)logpθ(x(t))=11αtˉϵθ(x(t))\nabla_{x^{(t)}}\log p_\theta(x^{(t)})=-\frac{1}{\sqrt{1-\bar{\alpha_t}}}\epsilon_\theta(x^{(t)})

We can derive a score function for p(x(t))p(yx(t))p(x^{(t)})p(y|x^{(t)})

x(t)log(pθ(x(t)pϕ(yx(t)))=x(t)logpθ(x(t))+x(t)logpϕ(yx(t))=11αtˉϵθ(x(t))+x(t)logpϕ(yx(t))\nabla_{x^{(t)}}\log (p_\theta(x^{(t)} p_\phi(y|x^{(t)}))=\nabla_{x^{(t)}}\log p_\theta(x^{(t)})+\nabla_{x^{(t)}}\log p_\phi(y|x^{(t)})= -\frac{1}{\sqrt{1-\bar{\alpha_t}}}\epsilon_\theta(x^{(t)})+\nabla_{x^{(t)}}\log p_\phi(y|x^{(t)})

We obtain a new predictor ϵ(x(t))^\hat{\epsilon(x^{(t)})} which corresponds to the score of joint distribuiton: ϵ(x(t))^=ϵθ(x(t))1αtˉx(t)logpϕ(yx(t))\hat{\epsilon(x^{(t)})} = \epsilon_\theta(x^{(t)})-\sqrt{1-\bar{\alpha_t}}\nabla_{x^{(t)}}\log p_\phi(y|x^{(t)}). Thus, conditional sapmling for DDIM can be done with the following process.

3. Next article

In this article, we discussed about conditional generation with diffusion model. DDPM have been already discussed in the previous article, but DDIM and the concept of score-matching is weakly introduced.
So in the next article, we'll be able to reveal the relationship between DDPM and energy-based models, including noise-conditoned score networks(NCSN) proposed at Song et al. and DDIM.

profile
interested in 🖥️,🧠,🧬,⚛️

0개의 댓글