[논문리뷰] Deep Unsupervised Learning using Nonequilibrium Thermodynamics (DPM, 2015)

김민서·2024년 7월 7일
0

Diffusion Models

목록 보기
1/8

0. Abstract

  • "Diffusion Model"을 최초로 제안한 논문
  • non-equilibrium statistical physics에 착안함

1. Introduction

tractability & flexibility tradeoff

  • tradeoff in probabilistic models
  • models that are tractable
    • e.g. Gaussian or Laplace
    • can be analytically evaluated and easily fit to data
    • 데이터셋이 너무 방대하면 모델링하는데 무리가 있음
  • models that are flexible
    • can be molded to fit structure in arbitrary data
    • we can define models in terms of any (non-negative) function ϕ(x)\phi (\mathrm{x}) yielding the flexible distribution p(x)=ϕ(x)Zp(\mathrm{x})=\frac{\phi (\mathrm{x})}{Z}
    • normalization constant ZZ의 계산이 intractable하기 때문에 flexible model을 train/evaluate/inference 하려면 아주 많은 Monte Carlo process가 필요함

1.1. Diffusion probabilistic models (DPM)

  • present DPM that allows
    1. extreme flexibility in model structure
    2. exact sampling
    3. easy multiplication with other distributions (to compute posterior)
    4. can cheaply evaluate log-likelihood and probability of individual states

2. Algorithm

2.0. Overview

  • 먼저 target (data) distribution --> simple known (Gaussian) distribution으로 보내는 forward diffusion process를 정의
  • finite-time reversal of diffusion process를 학습
  • also derive entropy bounds

2.1. Forward Trajectory

  • data dist. q(x(0))q(\mathrm{x}^{(0)}) --> tractable dist. π(y)\pi (\mathrm{y}) by repeated Markov diffusion kernel Tπ(yy;β)T_{\pi}(\mathrm{y}|\mathrm{y}';\beta) for π(y)\pi (\mathrm{y}) where β\beta is the diffusion rate
    π(y)=dy Tπ(yy;β) π(y)\begin{aligned} \pi (\mathrm{y})=\int d\mathrm{y}'\ T_{\pi}(\mathrm{y}|\mathrm{y}';\beta)\ \pi(\mathrm{y}') \end{aligned}
    Forward Diffusion kernel: q(x(t)x(t1))=Tπ(x(t)x(t1);βt)=N(x(t);x(t1)1βt, Iβt) if Gaussian\begin{aligned} \mathrm{Forward \ Diffusion\ kernel:}\ q(\mathrm{x}^{(t)}|\mathrm{x}^{(t-1)})&=T_{\pi}(\mathrm{x}^{(t)}|\mathrm{x}^{(t-1)};\beta_{t}) \\ &= \mathcal{N}(\mathrm{x}^{(t)};\mathrm{x}^{(t-1)}\sqrt{1-\beta_{t}},\ \mathrm{I}\beta_{t}) \mathrm{\ if\ Gaussian} \end{aligned}
    Forward Trajectory: q(x(0...T))=q(x(0)) Πt=1T q(x(t)x(t1))\begin{aligned} \mathrm{Forward \ Trajectory:}\ q(\mathrm{x^{(0...T)}})=q(\mathrm{x}^{(0)})\ \Pi_{t=1}^{T}\ q(\mathrm{x}^{(t)}|\mathrm{x}^{(t-1)}) \end{aligned}
  • q(x(t)x(t1))q(\mathrm{x}^{(t)}|\mathrm{x}^{(t-1)})은 Gaussian -> Gaussian(with identity coveariance) 또는 Binomial -> Binomial(independent)으로 보내는 kernel이다

2.2. Reverse Trajectory

p(x(T))=π(x(T))p(\mathrm{x}^{(T)})=\pi(\mathrm{x}^{(T)})
Reverse Diffusion kernel: p(x(t1)x(t))=N(x(t1);fμ (x(t),t), fΣ (x(t),t)) if Gaussian\mathrm{Reverse \ Diffusion\ kernel:}\ p(\mathrm{x}^{(t-1)}|\mathrm{x}^{(t)}) = \mathcal{N}(\mathrm{x}^{(t-1)};\mathrm{f}_{\mu}\ (\mathrm{x}^{(t)},t),\ \mathrm{f}_{\Sigma}\ (\mathrm{x}^{(t)},t)) \mathrm{\ if\ Gaussian}
Reverse Trajectory: p(x(0...T))=p(x(T)) Πt=1T p(x(t1)x(t))\mathrm{Reverse \ Trajectory:}\ p(\mathrm{x^{(0...T)}})=p(\mathrm{x}^{(T)})\ \Pi_{t=1}^{T}\ p(\mathrm{x}^{(t-1)}|\mathrm{x}^{(t)})
  • Gaussian/Binomial이고 continuous한 diffusion에 대해 (+ 충분히 작은 step size β\beta), reverse process는 forward process와 같은 functional form을 가진다 (Feller, 1949)
  • 학습할 때는
    • Gaussian : 각 kernel의 mean fμ (x(t),t)\mathrm{f}_{\mu}\ (\mathrm{x}^{(t)},t)과 covariance fΣ (x(t),t)\mathrm{f}_{\Sigma}\ (\mathrm{x}^{(t)},t)
    • Binomial : 각 kernel의 bit flip probability fb (x(t),t)\mathrm{f}_{b}\ (\mathrm{x}^{(t)},t)
    • 를 학습한다

2.3. Model Probability

  • 데이터에 대한 generative model의 확률은
    p(x(0))=dx(1T)p(x(0T))p(\mathrm{x}(0))=∫d\mathrm{x}^{(1⋯T)}p(\mathrm{x}^{(0⋯T)})
  • 원래 이 계산은 intractable하지만, 아래와 같이 바꿔 쓸 수 있다
    - based on annealed importance sampling & Jarzynski equality
    - instead evaluate the relative probability of forward & reverse trajectories, averaged over forward trajectories
    p(x(0))=dx(1T)p(x(0T))q(x(1T)x(0))q(x(1T)x(0))=dx(1T)q(x(1T)x(0))p(x(0T))q(x(1T)x(0))=dx(1T)q(x(1T)x(0)) p(x(T))t=1Tp(x(t1)x(t))q(x(t)x(t1))\begin{aligned} p(\mathrm{x}^{(0)})&=∫d\mathrm{x}^{(1⋯T)} p(\mathrm{x}^{(0⋯T)}) \frac{q(\mathrm{x}^{(1⋯T)}|\mathrm{x}^{(0)})} {q(\mathrm{x}^{(1⋯T)}|\mathrm{x}^{(0)})}\\ &=∫d\mathrm{x}^{(1⋯T)} q(\mathrm{x}^{(1⋯T)}|\mathrm{x}^{(0)}) \frac{p(\mathrm{x}^{(0⋯T)})} {q(\mathrm{x}^{(1⋯T)}|\mathrm{x}^{(0)})}\\ &=∫d\mathrm{x}^{(1⋯T)} q(\mathrm{x}^{(1⋯T)}|\mathrm{x}^{(0)})\ \cdot p(\mathrm{x}^{(T)})∏_{t=1}^{T} \frac{p(\mathrm{x}^{(t-1)}|\mathrm{x}^{(t)})}{q(\mathrm{x}^{(t)}|\mathrm{x}^{(t-1)})} \end{aligned}
  • This can be evaluated rapidly by averaging over samples from the forward trajectory q(x(1T)x(0))q(\mathrm{x}^{(1⋯T)}|\mathrm{x}^{(0)})
  • For infinitesimal β\beta,
    • forward & diffusion kernel이 identical해짐
    • then only a single sample from q(x(1T)x(0))q(\mathrm{x}^{(1⋯T)}|\mathrm{x}^{(0)}) is required to evaluate the above integral

2.4. Training

  • Training amounts to maximize the model log likelihood
    L=dx(0)q(x(0)) log p(x(0))=dx(0)q(x(0)) log[dx(1T)q(x(1T)x(0)) p(x(T))t=1Tp(x(t1)x(t))q(x(t)x(t1))]\begin{aligned} L&=\int d\mathrm{x}^{(0)} q(\mathrm{x}^{(0)})\ \mathrm{log}\ p(\mathrm{x}^{(0)})\\ &= \int d\mathrm{x}^{(0)} q(\mathrm{x}^{(0)})\ \cdot\\ &\mathrm{log}\left[∫d\mathrm{x}^{(1⋯T)} q(\mathrm{x}^{(1⋯T)}|\mathrm{x}^{(0)})\ \cdot p(\mathrm{x}^{(T)})∏_{t=1}^{T} \frac{p(\mathrm{x}^{(t-1)}|\mathrm{x}^{(t)})}{q(\mathrm{x}^{(t)}|\mathrm{x}^{(t-1)})}\right] \end{aligned}
  • which has a lower bound provided by Jensen's inequality,
    Ldx(0...T)q(x(0...T)) log[p(x(T))t=1Tp(x(t1)x(t))q(x(t)x(t1))]=K\begin{aligned} L\geq &\int d\mathrm{x}^{(0...T)} q(\mathrm{x}^{(0...T)})\ \cdot\\ &\mathrm{log}\left[p(\mathrm{x}^{(T)})∏_{t=1}^{T} \frac{p(\mathrm{x}^{(t-1)}|\mathrm{x}^{(t)})}{q(\mathrm{x}^{(t)}|\mathrm{x}^{(t-1)})}\right]=K \end{aligned}
  • Appendix B에 의해 KK는 anlytical하게 계산 가능한 entropies와 KL divergences로 정리된다
    K=Σt=1T dx(0)dx(t)q (x(0),x(t)) DKL(q (x(t1)x(t),x(0))  p (x(t1)x(t)))+Hq(X(T)X(0))Hq(X(1)X(0))Hp(X(T))\begin{aligned} K&=-\Sigma_{t=1}^{T} \int\ d\mathrm{x}^{(0)}d\mathrm{x}^{(t)}q\ (\mathrm{x}^{(0)},\mathrm{x}^{(t)})\ \cdot D_{KL}(q\ (\mathrm{x}^{(t-1)}|\mathrm{x}^{(t)}, \mathrm{x}^{(0)})\ ||\ p\ (\mathrm{x}^{(t-1)}|\mathrm{x}^{(t)}))\\ &+ H_{q}(\mathrm{X}^{(T)}|\mathrm{X}^{(0)}) - H_{q}(\mathrm{X}^{(1)}|\mathrm{X}^{(0)}) - H_{p}(\mathrm{X}^{(T)}) \end{aligned}
  • β\beta가 충분히 작아서 forward process \approx reverse process가 되면 L=KL = K
  • training은 KK를 maximize하는 reverse Markov transitions 를 찾는 방향으로 진행됨
    - p^(x(t1)x(t))= argmaxp(x(t1)x(t)) K\hat{p}(\mathrm{x}^{(t-1)}|\mathrm{x}^{(t)})=\ \underset{p(\mathrm{x}^{(t-1)}|\mathrm{x}^{(t)})}{\mathrm{argmax}}\ K
  • probability distribution estimation은 sequence of Gaussians로 이루어진 function을 regression한는 task로 reduced된다

2.4.1. Setting the Diffusion Rate βt\beta_{t}

  • beta scheduling은 모델 성능에 큰 영향을 미침
  • Gaussian의 경우 β2...T\beta_{2...T}KK의 gradient ascent에 따라 달라지게 설정했음
    - 단, overfitting을 막기 위해 β1\beta_{1}은 small constant로 고정
  • Binomial의 경우 βt=(Tt+1)1\beta_{t}=(T-t+1)^{-1}로 freeze
  • 후속 연구들에 따르면 Gaussian도 fixed schedule로 고정하는 게 낫다고 함

2.5. Multiplying Distributions, and Computing Posteriors

2.5.1. Modified Marginal Distributions

2.5.2. Modified Diffusion Steps

2.5.3. Applying r(x(t))r(\mathrm{x}^{(t)})

2.5.4. Choosing r(x(t))r(\mathrm{x}^{(t)})

2.6. Entropy of Reverse Process

3. Experiments

3.1. Toy Problems

  • 3.1.1. Swiss Roll

    • 2D swiss roll dist. 학습

    • RBF network로 mean fμ (x(t),t)\mathrm{f}_{\mu}\ (\mathrm{x}^{(t)},t)과 covariance fΣ (x(t),t)\mathrm{f}_{\Sigma}\ (\mathrm{x}^{(t)},t) 생성하게 함

    • 그림에서 세 번째 행은 drift term에 해당함

  • 3.1.2. Binary Heartbeat Distribution

    • length 20짜리 simple binary sequences로 학습
      • 1 occurs every 5th time bin (나머지는 0)
    • MLP로 Bernoulli rates fb (x(t),t)\mathrm{f}_{b}\ (\mathrm{x}^{(t)},t) 생성하게 함

3.2. Images

  • multi-scale convolutional architectures used
  • MNIST, CIFAR-10, Dead Leaf Images, Bark Texture Images
  • generation & inpainting task

Appendix

A. Conditional Entropy Bounds Derivation

B. Log Likelihood Lower Bound

  • initial lower bound of the log likelihood

    K=dx(0...T)q(x(0...T)) log[p(x(T))t=1Tp(x(t1)x(t))q(x(t)x(t1))]\begin{aligned} K=\int d\mathrm{x}^{(0...T)} q(\mathrm{x}^{(0...T)})\ \mathrm{log}[p(\mathrm{x}^{(T)})∏_{t=1}^{T} \frac{p(\mathrm{x}^{(t-1)}|\mathrm{x}^{(t)})}{q(\mathrm{x}^{(t)}|\mathrm{x}^{(t-1)})}] \end{aligned}
  • B.1. Entropy of p(X(T))p(\mathrm{X}^{(T)})

  • B.2. Remove the edge effect at t=0t=0

  • B.3. Rewrite in terms of posterior q(x(t1)x(0))q(\mathrm{x}^{(t-1)}|\mathrm{x}^{(0)})
  • B.4. Rewrite in terms of KL divergences and entropies

C. Perturbed Gaussian Transition

0개의 댓글