Diffusion : Road to DDPM

Junha Park·2023년 2월 4일
0

Advanced Computer Vision

목록 보기
1/2
post-thumbnail

This article is about understanding the fundamentals of diffusion model.

1. Fundamentals of diffusion process

Deep Unsupervised Learning using Nonequilibrium Thermodynamics: Sohl-Dickstein et al.

Tradeoff between tractability & flexibility

Tractable models can be analytically evaluated & easily fit to data, while they are unable to describe complex structures in rich datasets. On the other hand, flexible models can fit in arbitrary data structure. However, when we represent a flexible distribution p(x)=ϕ(x)Zp(x)=\frac{\phi(x)}{Z}, normalization constant Z is generally intractable: normalization constant is generally yielded by computationally expensive monte carlo process.

Diffusion probablistic model

Diffusion probablistic model uses a Markov chain to gradually convert one distribution into another: aim to build a generative Markov chain which converts a simple distribution into target data distribution.
Recall that generative model aims at learning data distribution p(x)p(x), and markov chain parameterized by θ\theta will be learnt to convert simple distribution q()p(x)q(\cdot) \rightarrow p(x).

Diffusion process is described by markov difussion kernel Tπ(yy;β)T_{\pi}(y|y';\beta), where β\beta is the diffusion rate. Starting from the simple distribution, we can repeatedly apply markov diffusion kernel:

π(y)=Tπ(yy;β)π(y)dy,q(x(t)x(t1))=Tπ(yy;β)\pi(y)= \int T_\pi(y|y';\beta)\pi(y')dy', q(x^{(t)}|x^{(t-1)})=T_\pi(y|y';\beta)

The forward trajectory performs T step s of diffusion, transforming data distribution q(x(0))SimpleDist()q(x^{(0)})\rightarrow \text{SimpleDist}(\cdot), and simple distribution can be gaussian distribution with identity-covariance for gaussian diffusion process or independant binomial distribution for binomain diffusion process. Forward diffusion process is formulated as description below.

q(x(0,...,T))=q(x(0))t=1Tq(x(t)x(t1))q(x^{(0,...,T)}) =q(x^{(0)})\prod_{t=1}^Tq(x^{(t)}|x^{(t-1)})

The reverse trajectory, learning the reverse of forward diffusion process enables transformation of simple distribution into data generating distribution. We want the final destination of reverse trajectory p(X(T))p(X^{(T)}) to be tractaable, simple distribution π(x(T))\pi(x^{(T)}). Then how can we find the reverse diffusion process? For both Gaussian and binomial diffusion, it is known that for sufficiently small step size β\beta, the reversal of the diffusion process has the identical functional form as the forward process.

Since the form of reverse process is confirmed, it is enough to estimate parameters of diffusion kernel for every timeste. For gaussian diffusion kernel, fμ(x(t),t),fΣ(x(t),t)f_\mu(x^{(t)},t), f_\Sigma(x^{(t)},t) should be estimated with neural network. For binomial diffusion kernel, fb(x(t),t)f_b(x^{(t)},t) is being estimated by neural network. For all results in the paper, multi-layer perceptrons are used to define these functions.

Training: Maximizing the model log likelihood

Training objective of the diffusion probablistic model, the log likelihood is described as below.

L=Ex(0)q()[logp(x(0))]=dx(0)q(x(0))logp(x(0))L = \mathbb{E}_{x^{(0)}\sim q(\cdot)}[\log p(x^{(0)})] = \int dx^{(0)}q(x^{(0)})\log p(x^{(0)})

This objective, including integral is intractable, but Jarynski equality enables computation of given equation by evaluating relative probability of the forward & reverse trajectories.

Remark) Jarzynski equality gives
p(x(0))=dx(1,...,T)q(x(1,...,T)x(0))p(x(0,...,T))q(x(1,...,T)x(0))=dx(1,...,T)q(x(1,...,T)x(0))p(x(T))t=1Tp(x(t1))p(x(t))q(x(t)x(t1))p(x^{(0)})=\int dx^{(1,...,T)}q(x^{(1,...,T)}|x^{(0)})\frac{p(x^{(0,...,T)})}{q(x^{(1,...,T)}|x^{(0)})} \\ = \int dx^{(1,...,T)}q(x^{(1,...,T)}|x^{(0)})p(x^{(T)})\prod_{t=1}^T\frac{p(x^{(t-1)})|p(x^{(t)})}{q(x^{(t)}|x^{(t-1)})}

Thus objective function

L=dx(0)q(x(0))logp(x(0))=dx(0)q(x(0))log[dx(1,...,T)q(x(1,...,T)x(0))p(x(T))t=1Tp(x(t1)x(t))q(x(t)x(t1))]L=\int dx^{(0)}q(x^{(0)})\log p(x^{(0)}) \\= \int dx^{(0)}q(x^{(0)})\cdot \log[\int dx^{(1,...,T)}q(x^{(1,...,T)}|x^{(0)})\cdot p(x^{(T)})\prod_{t=1}^T\frac{p(x^{(t-1)}|x^{(t)})}{q(x^{(t)}|x^{(t-1)})}]

Lower bound of the objective function is obtained for variational inference, while the lower bound is constructed with jensen inequality-like derivation. For simplicity, let's assume that LKL\geq K satisfies so that KK be the lower bound. Then, training consists of finding the reverse markov transition kernel to maximize this lower bound of the log likelihood.

p^(x(t1)x(t))=arg maxp(x(t1)x(t))K\hat{p}(x^{(t-1)}|x^{(t)})=\argmax_{p(x^{(t-1)}|x^{(t)})}K

Mathematical derivation of lower bound is presented at the end of this article, and this is the underlying concepts of "denoising process", which is introduced at Ho et al.(2020)'s paper.

2. Denoising diffusion probabilistic models: intuitive explanation

Ho et al.(2020) provides a much intuitive explanation about diffusion probabilistic models by introducing a concept of denoising, which 'seems' visually analogous to existing deep generative models including VAE, GAN, or normalizing flows.

Key properties of diffusion process

Denoising diffusion probabilistic models, simply DDPM, adopts gaussian noise for forward and reverse diffusion process.

q(x(t)x(t1))=N(x(t);1βtx(t1),βtI)pθ(x(t1)x(t))=N(x(t1);μθ(x(t),t),Σθ(x(t),t))q(x^{(t)}|x^{(t-1)})=\mathcal{N}(x^{(t)};\sqrt{1-\beta_t}x^{(t-1)},\beta_t\mathbf{I}) \\ p_\theta(x^{(t-1)}|x^{(t)})=\mathcal{N}(x^{(t-1)};\mu_\theta(x^{(t)},t),\Sigma_\theta(x^{(t)},t))

Key property of forward process is that, we can directly sample x(t)x^{(t)} when starting x(0)x^{(0)} is given without computing tt timesteps: since each markov transition kernel is gaussian kernel, tt consecutive transitions of gaussian kernel is also gaussian. Especially, we can exploit gaussian properties to prove that

q(x(t)x(0))=N(x(t);αtˉx(0),(1αtˉI))q(x^{(t)}|x^{(0)})=\mathcal{N}(x^{(t)};\sqrt{\bar{\alpha_t}}x^{(0)},(1-\bar{\alpha_t}\mathbf I))

using the notation αt=1βt,αtˉ=s=1tαs\alpha_t=1-\beta_t,\bar{\alpha_t}=\prod_{s=1}^t\alpha_s.

Equation above is justified by following derivation. First, we can reparametrize the markov process in a closed from.
q(x(t)x(t1))=N(x(t);1βtx(t1),βtI) is equivalent with x(t)=1βtx(t1)+βtϵt1 where ϵN(0,1)q(x^{(t)}|x^{(t-1)})=\mathcal{N}(x^{(t)};\sqrt{1-\beta_t}x^{(t-1)},\beta_t\mathbf{I}) \text{ is equivalent with } \\ x^{(t)}=\sqrt{1-\beta_t}x^{(t-1)}+\sqrt{\beta_t}\epsilon_{t-1} \text{ where }\epsilon\sim\mathcal{N}(0,1)

This can be applied in a recursive manner, and recall the properties of gaussian distribution. So

x(t)=αtx(t1)+1αtϵt1=αt(αt1x(t2)+1αt1ϵt2)+1αtϵt1=αtαt1xt2+1αtαt1ϵt2=...=αtˉx(0)+(1αtˉ)ϵx^{(t)}=\sqrt{\alpha_t}x^{(t-1)}+\sqrt{1-\alpha_t}\epsilon_{t-1} \\ =\sqrt{\alpha_t}(\sqrt{\alpha_{t-1}x^{(t-2)}}+\sqrt{1-\alpha_{t-1}}\epsilon_{t-2})+\sqrt{1-\alpha_t}\epsilon_{t-1} \\ =\sqrt{\alpha_t\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_t\alpha_{t-1}}\epsilon'_{t-2} = ... = \sqrt{\bar{\alpha_t}}x^{(0)}+(1-\bar{\alpha_t})\epsilon

Therefore, q(x(t)x(0))=N(x(t);αtˉx(0),(1αtˉ)I)q(x^{(t)}|x^{(0)})=\mathcal{N}(x^{(t)};\sqrt{\bar{\alpha_t}}x^{(0)},(1-\bar{\alpha_t})\mathbf{I}).
Let's call it Nice property.

Another notable property of forward process is about transition kernel conditioned by initial point x(0)x^{(0)}. It is mathematically proved that q(x(t1)x(t),x(0))q(x^{(t-1)}|x^{(t)},x^{(0)}), transition kernel conditioned by x(0)x^{(0)} is tractable.

N(x(t1);μ~t(x(t),x(0)),β~tI),where μt~(x(t),x(0))=αˉt1βt1αtˉx(0)+αt(1αt1ˉ)1αtˉx(t)βt~=1αˉt11αˉtβt\mathcal{N}(x^{(t-1)};\tilde\mu_t(x^{(t)},x^{(0)}),\tilde{\beta}_t\mathbf I), \text{where }\\ \tilde{\mu_t}(x^{(t)},x^{(0)})=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha_t}}x^{(0)}+\frac{\sqrt{\alpha_t}(1-\bar{\alpha_{t-1}})}{1-\bar{\alpha_t}}x^{(t)} \\ \tilde{\beta_t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t

Why are these properties so important? Previously, we have mentioned about the lower bound of log likelihood KK. Through mathematical derivation, we can show that expectation value of negative log likelihood has a upper bound of

Eq()[DKL(q(x(T)x(0))p(x(T)))+t>1DKL(q(x(t1)x(t),x(0))p(x(t)x(t1)))logpθ(x(0)x(1))]\mathbb{E}_{q(\cdot)}[D_{KL}(q(x^{(T)}|x^{(0)})||p(x^{(T)})) + \sum_{t>1}D_{KL}(q(x^{(t-1)}|x^{(t)},x^{(0)})||p(x^{(t)}|x^{(t-1)}))-\log p_\theta(x^{(0)}|x^{(1)})]

First term DKL(q(x(T)x(0))p(x(T)))D_{KL}(q(x^{(T)}|x^{(0)})||p(x^{(T)})) denotes LTL_T, which is constant(x(T)x^{(T)} is gaussian noise, and q()q(\cdot) has no learnable parameters) thus can be ignored during training.
Second term, which is a summation of T1T-1 elements can be represented as Lt1=DKL(q(x(t1)x(t),x(0))p(x(t)x(t1)))L_{t-1}=D_{KL}(q(x^{(t-1)}|x^{(t)},x^{(0)})||p(x^{(t)}|x^{(t-1)})) for t=2,...,Tt=2, ..., T. So the variational lower bound loss LVLBL_{VLB} can be separated into

LT+LT1+...+L0,whereLT=DKL(q(x(T)x(0))p(x(T))Lt=DKL(q(x(t)x(t+1),x(0))p(x(t)x(t+1)))L0=logp(x(0)x(1))L_T+L_{T-1}+...+L_{0}, \text{where} \\ L_T = D_{KL}(q(x^{(T)}|x^{(0)})\||p(x^{(T)})\\ L_t = D_{KL}(q(x^{(t)}|x^{(t+1)},x^{(0)})||p(x^{(t)}|x^{(t+1)}))\\ L_0 = -\log p(x^{(0)}|x^{(1)})

Parameterization of KL divergence term for training loss design

Main contribution of Ho et al.(2020) is the simple parameterizeation of loss term which is represented in a form of KL divergence. Recall that we would like to train neural network to learn diffusion process, p(x(t1)x(t))=N(x(t1);μθ(x(t),t),Σθ(x(t),t))p(x^{(t-1)}|x^{(t)})=\mathcal{N}(x^{(t-1)};\mu_\theta(x^{(t)},t),\Sigma_\theta(x^{(t)},t)) and KL divergence term LtL_t is responsible for that purpose.
Lt=DKL(q(x(t)x(t+1),x(0))p(x(t)x(t+1)))L_t= D_{KL}(q(x^{(t)}|x^{(t+1)},x^{(0)})||p(x^{(t)}|x^{(t+1)})), and KL divergence evaluates the discrepancy between two distributions. First part q(x(t)x(t+1),x(0))q(x^{(t)}|x^{(t+1)},x^{(0)}) is a gaussian distribution with mean μt~\tilde{\mu_t} and variance βt~I\tilde{\beta_t}\mathbf{I}, while mean is simplified by nice property.

μt~(x(t),x(0))=αˉt1βt1αtˉx(0)+αt(1αt1ˉ)1αtˉx(t)x(t)=αtˉx(0)+(1αtˉ)ϵμt~=1αt(xt1αt1αtˉϵ(x(t),t))\tilde{\mu_t}(x^{(t)},x^{(0)})=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha_t}}x^{(0)}+\frac{\sqrt{\alpha_t}(1-\bar{\alpha_{t-1}})}{1-\bar{\alpha_t}}x^{(t)} \\ x^{(t)}=\sqrt{\bar{\alpha_t}}x^{(0)}+(1-\bar{\alpha_t})\epsilon \\ \rightarrow \tilde{\mu_t}=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha_t}}}\epsilon(x^{(t)},t))

Thus the loss term LtL_t is simplified into MSE loss, if we ignores the weighting constant. Intuitively, reducing the KL divergence between two gaussian is about predicting the mean and variance of reverse diffusion process and this leads to 'noise' estimation, formulated as Lt=Ex0,ϵt[ϵtϵθ(αtˉx(0)+(1αtˉ)ϵt,t)2]L_t=\mathbb{E}_{x_0,\epsilon_t}[||\epsilon_t-\epsilon_\theta(\sqrt{\bar{\alpha_t}}x^{(0)}+(1-\bar{\alpha_t})\epsilon_t, t)||^2]. Additionally, loss term L0L_0 is modeled by a separate discrete decoder to restrict final destination of the diffusion process to generate plausible RGB images.

3. Next article

This article tries to explain concept of DDPM, but the mathematical linkage with score-based generative modeling and langevin dynamics are not explained.

Later, we'll be able to reveal the relationship between DDPM and energy-based models, including noise-conditoined score networks(NCSN) proposed at Song et al.

For the next article, we'll discuss about conditional generation with diffusion model with classifier guidance.

profile
interested in 🖥️,🧠,🧬,⚛️

0개의 댓글