Diffusion : Road to DDPM

Junha Park·2023년 2월 4일

Computer Vision paper-review

Advanced Computer Vision

목록 보기

1/2

This article is about understanding the fundamentals of diffusion model.

1. Fundamentals of diffusion process

Deep Unsupervised Learning using Nonequilibrium Thermodynamics: Sohl-Dickstein et al.

Tradeoff between tractability & flexibility

Tractable models can be analytically evaluated & easily fit to data, while they are unable to describe complex structures in rich datasets. On the other hand, flexible models can fit in arbitrary data structure. However, when we represent a flexible distribution $p(x)=\frac{\phi(x)}{Z}$ , normalization constant Z is generally intractable: normalization constant is generally yielded by computationally expensive monte carlo process.

Diffusion probablistic model

Diffusion probablistic model uses a Markov chain to gradually convert one distribution into another: aim to build a generative Markov chain which converts a simple distribution into target data distribution.
Recall that generative model aims at learning data distribution $p(x)$ , and markov chain parameterized by $\theta$ will be learnt to convert simple distribution $q(\cdot) \rightarrow p(x)$ .

Diffusion process is described by markov difussion kernel $T_{\pi}(y|y';\beta)$ , where $\beta$ is the diffusion rate. Starting from the simple distribution, we can repeatedly apply markov diffusion kernel:

\pi(y)= \int T_\pi(y|y';\beta)\pi(y')dy', q(x^{(t)}|x^{(t-1)})=T_\pi(y|y';\beta)

The forward trajectory performs T step s of diffusion, transforming data distribution $q(x^{(0)})\rightarrow \text{SimpleDist}(\cdot)$ , and simple distribution can be gaussian distribution with identity-covariance for gaussian diffusion process or independant binomial distribution for binomain diffusion process. Forward diffusion process is formulated as description below.

q(x^{(0,...,T)}) =q(x^{(0)})\prod_{t=1}^Tq(x^{(t)}|x^{(t-1)})

The reverse trajectory, learning the reverse of forward diffusion process enables transformation of simple distribution into data generating distribution. We want the final destination of reverse trajectory $p(X^{(T)})$ to be tractaable, simple distribution $\pi(x^{(T)})$ . Then how can we find the reverse diffusion process? For both Gaussian and binomial diffusion, it is known that for sufficiently small step size $\beta$ , the reversal of the diffusion process has the identical functional form as the forward process.

Since the form of reverse process is confirmed, it is enough to estimate parameters of diffusion kernel for every timeste. For gaussian diffusion kernel, $f_\mu(x^{(t)},t), f_\Sigma(x^{(t)},t)$ should be estimated with neural network. For binomial diffusion kernel, $f_b(x^{(t)},t)$ is being estimated by neural network. For all results in the paper, multi-layer perceptrons are used to define these functions.

Training: Maximizing the model log likelihood

Training objective of the diffusion probablistic model, the log likelihood is described as below.

L = \mathbb{E}_{x^{(0)}\sim q(\cdot)}[\log p(x^{(0)})] = \int dx^{(0)}q(x^{(0)})\log p(x^{(0)})

This objective, including integral is intractable, but Jarynski equality enables computation of given equation by evaluating relative probability of the forward & reverse trajectories.

Remark) Jarzynski equality gives

p(x^{(0)})=\int dx^{(1,...,T)}q(x^{(1,...,T)}|x^{(0)})\frac{p(x^{(0,...,T)})}{q(x^{(1,...,T)}|x^{(0)})} \\ = \int dx^{(1,...,T)}q(x^{(1,...,T)}|x^{(0)})p(x^{(T)})\prod_{t=1}^T\frac{p(x^{(t-1)})|p(x^{(t)})}{q(x^{(t)}|x^{(t-1)})}

Thus objective function

L=\int dx^{(0)}q(x^{(0)})\log p(x^{(0)}) \\= \int dx^{(0)}q(x^{(0)})\cdot \log[\int dx^{(1,...,T)}q(x^{(1,...,T)}|x^{(0)})\cdot p(x^{(T)})\prod_{t=1}^T\frac{p(x^{(t-1)}|x^{(t)})}{q(x^{(t)}|x^{(t-1)})}]

Lower bound of the objective function is obtained for variational inference, while the lower bound is constructed with jensen inequality-like derivation. For simplicity, let's assume that $L\geq K$ satisfies so that $K$ be the lower bound. Then, training consists of finding the reverse markov transition kernel to maximize this lower bound of the log likelihood.

\hat{p}(x^{(t-1)}|x^{(t)})=\argmax_{p(x^{(t-1)}|x^{(t)})}K

Mathematical derivation of lower bound is presented at the end of this article, and this is the underlying concepts of "denoising process", which is introduced at Ho et al.(2020)'s paper.

2. Denoising diffusion probabilistic models: intuitive explanation

Ho et al.(2020) provides a much intuitive explanation about diffusion probabilistic models by introducing a concept of denoising, which 'seems' visually analogous to existing deep generative models including VAE, GAN, or normalizing flows.

Key properties of diffusion process

Denoising diffusion probabilistic models, simply DDPM, adopts gaussian noise for forward and reverse diffusion process.

q(x^{(t)}|x^{(t-1)})=\mathcal{N}(x^{(t)};\sqrt{1-\beta_t}x^{(t-1)},\beta_t\mathbf{I}) \\ p_\theta(x^{(t-1)}|x^{(t)})=\mathcal{N}(x^{(t-1)};\mu_\theta(x^{(t)},t),\Sigma_\theta(x^{(t)},t))

Key property of forward process is that, we can directly sample $x^{(t)}$ when starting $x^{(0)}$ is given without computing $t$ timesteps: since each markov transition kernel is gaussian kernel, $t$ consecutive transitions of gaussian kernel is also gaussian. Especially, we can exploit gaussian properties to prove that

q(x^{(t)}|x^{(0)})=\mathcal{N}(x^{(t)};\sqrt{\bar{\alpha_t}}x^{(0)},(1-\bar{\alpha_t}\mathbf I))

using the notation $\alpha_t=1-\beta_t,\bar{\alpha_t}=\prod_{s=1}^t\alpha_s$ .

Equation above is justified by following derivation. First, we can reparametrize the markov process in a closed from.

q(x^{(t)}|x^{(t-1)})=\mathcal{N}(x^{(t)};\sqrt{1-\beta_t}x^{(t-1)},\beta_t\mathbf{I}) \text{ is equivalent with } \\ x^{(t)}=\sqrt{1-\beta_t}x^{(t-1)}+\sqrt{\beta_t}\epsilon_{t-1} \text{ where }\epsilon\sim\mathcal{N}(0,1)

This can be applied in a recursive manner, and recall the properties of gaussian distribution. So

x^{(t)}=\sqrt{\alpha_t}x^{(t-1)}+\sqrt{1-\alpha_t}\epsilon_{t-1} \\ =\sqrt{\alpha_t}(\sqrt{\alpha_{t-1}x^{(t-2)}}+\sqrt{1-\alpha_{t-1}}\epsilon_{t-2})+\sqrt{1-\alpha_t}\epsilon_{t-1} \\ =\sqrt{\alpha_t\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_t\alpha_{t-1}}\epsilon'_{t-2} = ... = \sqrt{\bar{\alpha_t}}x^{(0)}+(1-\bar{\alpha_t})\epsilon

Therefore, $q(x^{(t)}|x^{(0)})=\mathcal{N}(x^{(t)};\sqrt{\bar{\alpha_t}}x^{(0)},(1-\bar{\alpha_t})\mathbf{I})$ .
Let's call it Nice property.

Another notable property of forward process is about transition kernel conditioned by initial point $x^{(0)}$ . It is mathematically proved that $q(x^{(t-1)}|x^{(t)},x^{(0)})$ , transition kernel conditioned by $x^{(0)}$ is tractable.

\mathcal{N}(x^{(t-1)};\tilde\mu_t(x^{(t)},x^{(0)}),\tilde{\beta}_t\mathbf I), \text{where }\\ \tilde{\mu_t}(x^{(t)},x^{(0)})=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha_t}}x^{(0)}+\frac{\sqrt{\alpha_t}(1-\bar{\alpha_{t-1}})}{1-\bar{\alpha_t}}x^{(t)} \\ \tilde{\beta_t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t

Why are these properties so important? Previously, we have mentioned about the lower bound of log likelihood $K$ . Through mathematical derivation, we can show that expectation value of negative log likelihood has a upper bound of

\mathbb{E}_{q(\cdot)}[D_{KL}(q(x^{(T)}|x^{(0)})||p(x^{(T)})) + \sum_{t>1}D_{KL}(q(x^{(t-1)}|x^{(t)},x^{(0)})||p(x^{(t)}|x^{(t-1)}))-\log p_\theta(x^{(0)}|x^{(1)})]

First term $D_{KL}(q(x^{(T)}|x^{(0)})||p(x^{(T)}))$ denotes $L_T$ , which is constant( $x^{(T)}$ is gaussian noise, and $q(\cdot)$ has no learnable parameters) thus can be ignored during training.
Second term, which is a summation of $T-1$ elements can be represented as $L_{t-1}=D_{KL}(q(x^{(t-1)}|x^{(t)},x^{(0)})||p(x^{(t)}|x^{(t-1)}))$ for $t=2, ..., T$ . So the variational lower bound loss $L_{VLB}$ can be separated into

L_T+L_{T-1}+...+L_{0}, \text{where} \\ L_T = D_{KL}(q(x^{(T)}|x^{(0)})\||p(x^{(T)})\\ L_t = D_{KL}(q(x^{(t)}|x^{(t+1)},x^{(0)})||p(x^{(t)}|x^{(t+1)}))\\ L_0 = -\log p(x^{(0)}|x^{(1)})

Parameterization of KL divergence term for training loss design

Main contribution of Ho et al.(2020) is the simple parameterizeation of loss term which is represented in a form of KL divergence. Recall that we would like to train neural network to learn diffusion process, $p(x^{(t-1)}|x^{(t)})=\mathcal{N}(x^{(t-1)};\mu_\theta(x^{(t)},t),\Sigma_\theta(x^{(t)},t))$ and KL divergence term $L_t$ is responsible for that purpose.
$L_t= D_{KL}(q(x^{(t)}|x^{(t+1)},x^{(0)})||p(x^{(t)}|x^{(t+1)}))$ , and KL divergence evaluates the discrepancy between two distributions. First part $q(x^{(t)}|x^{(t+1)},x^{(0)})$ is a gaussian distribution with mean $\tilde{\mu_t}$ and variance $\tilde{\beta_t}\mathbf{I}$ , while mean is simplified by nice property.

\tilde{\mu_t}(x^{(t)},x^{(0)})=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha_t}}x^{(0)}+\frac{\sqrt{\alpha_t}(1-\bar{\alpha_{t-1}})}{1-\bar{\alpha_t}}x^{(t)} \\ x^{(t)}=\sqrt{\bar{\alpha_t}}x^{(0)}+(1-\bar{\alpha_t})\epsilon \\ \rightarrow \tilde{\mu_t}=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha_t}}}\epsilon(x^{(t)},t))

Thus the loss term $L_t$ is simplified into MSE loss, if we ignores the weighting constant. Intuitively, reducing the KL divergence between two gaussian is about predicting the mean and variance of reverse diffusion process and this leads to 'noise' estimation, formulated as $L_t=\mathbb{E}_{x_0,\epsilon_t}[||\epsilon_t-\epsilon_\theta(\sqrt{\bar{\alpha_t}}x^{(0)}+(1-\bar{\alpha_t})\epsilon_t, t)||^2]$ . Additionally, loss term $L_0$ is modeled by a separate discrete decoder to restrict final destination of the diffusion process to generate plausible RGB images.

3. Next article

This article tries to explain concept of DDPM, but the mathematical linkage with score-based generative modeling and langevin dynamics are not explained.

Later, we'll be able to reveal the relationship between DDPM and energy-based models, including noise-conditoined score networks(NCSN) proposed at Song et al.

For the next article, we'll discuss about conditional generation with diffusion model with classifier guidance.

Junha Park

interested in 🖥️,🧠,🧬,⚛️

다음 포스트