This article is about understanding the fundamentals of diffusion model.
1. Fundamentals of diffusion process
Deep Unsupervised Learning using Nonequilibrium Thermodynamics: Sohl-Dickstein et al.
Tradeoff between tractability & flexibility
Tractable models can be analytically evaluated & easily fit to data, while they are unable to describe complex structures in rich datasets. On the other hand, flexible models can fit in arbitrary data structure. However, when we represent a flexible distribution p(x)=Zϕ(x), normalization constant Z is generally intractable: normalization constant is generally yielded by computationally expensive monte carlo process.
Diffusion probablistic model
Diffusion probablistic model uses a Markov chain to gradually convert one distribution into another: aim to build a generative Markov chain which converts a simple distribution into target data distribution.
Recall that generative model aims at learning data distribution p(x), and markov chain parameterized by θ will be learnt to convert simple distribution q(⋅)→p(x).
Diffusion process is described by markov difussion kernel Tπ(y∣y′;β), where β is the diffusion rate. Starting from the simple distribution, we can repeatedly apply markov diffusion kernel:
The forward trajectory performs T step s of diffusion, transforming data distribution q(x(0))→SimpleDist(⋅), and simple distribution can be gaussian distribution with identity-covariance for gaussian diffusion process or independant binomial distribution for binomain diffusion process. Forward diffusion process is formulated as description below.
q(x(0,...,T))=q(x(0))t=1∏Tq(x(t)∣x(t−1))
The reverse trajectory, learning the reverse of forward diffusion process enables transformation of simple distribution into data generating distribution. We want the final destination of reverse trajectory p(X(T)) to be tractaable, simple distribution π(x(T)). Then how can we find the reverse diffusion process? For both Gaussian and binomial diffusion, it is known that for sufficiently small step size β, the reversal of the diffusion process has the identical functional form as the forward process.
Since the form of reverse process is confirmed, it is enough to estimate parameters of diffusion kernel for every timeste. For gaussian diffusion kernel, fμ(x(t),t),fΣ(x(t),t) should be estimated with neural network. For binomial diffusion kernel, fb(x(t),t) is being estimated by neural network. For all results in the paper, multi-layer perceptrons are used to define these functions.
Training: Maximizing the model log likelihood
Training objective of the diffusion probablistic model, the log likelihood is described as below.
L=Ex(0)∼q(⋅)[logp(x(0))]=∫dx(0)q(x(0))logp(x(0))
This objective, including integral is intractable, but Jarynski equality enables computation of given equation by evaluating relative probability of the forward & reverse trajectories.
Lower bound of the objective function is obtained for variational inference, while the lower bound is constructed with jensen inequality-like derivation. For simplicity, let's assume that L≥K satisfies so that K be the lower bound. Then, training consists of finding the reverse markov transition kernel to maximize this lower bound of the log likelihood.
p^(x(t−1)∣x(t))=p(x(t−1)∣x(t))argmaxK
Mathematical derivation of lower bound is presented at the end of this article, and this is the underlying concepts of "denoising process", which is introduced at Ho et al.(2020)'s paper.
Ho et al.(2020) provides a much intuitive explanation about diffusion probabilistic models by introducing a concept of denoising, which 'seems' visually analogous to existing deep generative models including VAE, GAN, or normalizing flows.
Key properties of diffusion process
Denoising diffusion probabilistic models, simply DDPM, adopts gaussian noise for forward and reverse diffusion process.
Key property of forward process is that, we can directly sample x(t) when starting x(0) is given without computing t timesteps: since each markov transition kernel is gaussian kernel, t consecutive transitions of gaussian kernel is also gaussian. Especially, we can exploit gaussian properties to prove that
q(x(t)∣x(0))=N(x(t);αtˉx(0),(1−αtˉI))
using the notation αt=1−βt,αtˉ=∏s=1tαs.
Equation above is justified by following derivation. First, we can reparametrize the markov process in a closed from.
q(x(t)∣x(t−1))=N(x(t);1−βtx(t−1),βtI) is equivalent with x(t)=1−βtx(t−1)+βtϵt−1 where ϵ∼N(0,1)
This can be applied in a recursive manner, and recall the properties of gaussian distribution. So
Therefore, q(x(t)∣x(0))=N(x(t);αtˉx(0),(1−αtˉ)I).
Let's call it Nice property.
Another notable property of forward process is about transition kernel conditioned by initial point x(0). It is mathematically proved that q(x(t−1)∣x(t),x(0)), transition kernel conditioned by x(0) is tractable.
Why are these properties so important? Previously, we have mentioned about the lower bound of log likelihood K. Through mathematical derivation, we can show that expectation value of negative log likelihood has a upper bound of
First term DKL(q(x(T)∣x(0))∣∣p(x(T))) denotes LT, which is constant(x(T) is gaussian noise, and q(⋅) has no learnable parameters) thus can be ignored during training.
Second term, which is a summation of T−1 elements can be represented as Lt−1=DKL(q(x(t−1)∣x(t),x(0))∣∣p(x(t)∣x(t−1))) for t=2,...,T. So the variational lower bound loss LVLB can be separated into
Parameterization of KL divergence term for training loss design
Main contribution of Ho et al.(2020) is the simple parameterizeation of loss term which is represented in a form of KL divergence. Recall that we would like to train neural network to learn diffusion process, p(x(t−1)∣x(t))=N(x(t−1);μθ(x(t),t),Σθ(x(t),t)) and KL divergence term Lt is responsible for that purpose. Lt=DKL(q(x(t)∣x(t+1),x(0))∣∣p(x(t)∣x(t+1))), and KL divergence evaluates the discrepancy between two distributions. First part q(x(t)∣x(t+1),x(0)) is a gaussian distribution with mean μt~ and variance βt~I, while mean is simplified by nice property.
Thus the loss term Lt is simplified into MSE loss, if we ignores the weighting constant. Intuitively, reducing the KL divergence between two gaussian is about predicting the mean and variance of reverse diffusion process and this leads to 'noise' estimation, formulated as Lt=Ex0,ϵt[∣∣ϵt−ϵθ(αtˉx(0)+(1−αtˉ)ϵt,t)∣∣2]. Additionally, loss term L0 is modeled by a separate discrete decoder to restrict final destination of the diffusion process to generate plausible RGB images.
3. Next article
This article tries to explain concept of DDPM, but the mathematical linkage with score-based generative modeling and langevin dynamics are not explained.
Later, we'll be able to reveal the relationship between DDPM and energy-based models, including noise-conditoined score networks(NCSN) proposed at Song et al.
For the next article, we'll discuss about conditional generation with diffusion model with classifier guidance.