[논문리뷰] High-Resolution Image Synthesis with Latent Diffusion Models (LDM, 2022)

김민서·2024년 7월 11일

Diffusion Models

목록 보기

6/8

Diffusion Models
- 기존 모델들(Autoregressive model, GANs)의 단점을 보완
  - mode collapse, training instabilities 해결
- SOTA performance
- RGB image를 그대로 다루기 때문에 high-dimensional space에서 계산이 이루어진다
  - training도 inference도 오래 걸림 + 자원 낭비
departure to latent space
- learning은 two-stage로 나눌 수 있음
  1) perceptual compression : high-freq details를 제거하고 semantic variation은 거의 안 배우는 구간
  2) semantic compression : semantic & conceptual composition of data를 배우는 구간, 우리가 보통 생각하는 'learning'
- basic idea: Let's find a perceptually equivalent,but computationally more suitable space
Latent Diffusion Models
- two distinct phases
  1) data space와 perceptually equivalent한 low-dim representational space로 보내는 autoencoder 학습
  2) learned latent space에서 Diffusion model 학습
- universal autoencoder 한 번만 학습하면 여러 DMs에 universal하게 적용 가능

image $x\ \in\ \mathbb{R}^{H\times W\times 3}$ 를 latent $z\ \in\ \mathbb{R}^{h\times w\times c}$ 로 보내는 encoder/decoder
- $z=\mathcal{E}(x)$
- $\tilde{x}=\mathcal{D}(\mathcal{E}(x))$
- encoder가 이미지를 $f= \frac{H}{h}= \frac{W}{w}$ 배만큼 downsample 시키는 과정으로 볼 수 있음
latent space의 variance가 커지는 것을 막기 위해 두 가지 regularizations를 실험해봄
- KL-reg : learned latent와 standard normal을 비교한 KL-penalty를 약간 걸어주기
- VQ-reg : decoder 단에서 vector quantized layer를 사용하기
autoencoder training
- 자세한 내용은 appendix C에
- discriminator $D_{\psi}$ 가 $x$ 와 $\mathcal{D}(\mathcal{E}(x))$ 를 구분하도록 adversarial training

기존 DDPM loss $L_{DM}=\mathbb{E}_{x,\epsilon \sim \mathcal{N}(0,1),t} \left[||\epsilon - \epsilon_{\theta}(x_{t},t) ||_{2}^{2} \right]$
LDM은 $x_{t}$ 말고 latent space에 있는 $z_{t}$ 를 쓴다 $L_{LDM}=\mathbb{E}_{\mathcal{E}(x),\epsilon \sim \mathcal{N}(0,1),t} \left[||\epsilon - \epsilon_{\theta}(z_{t},t) ||_{2}^{2} \right]$
advantages
1) 모델이 important, semantic bits of data에 집중할 수 있다
2) lower dimension에서 계산하니까 더 효율적이다
latent space에서 다뤄지는 기존 모델들과 달리 LDM에서는 이미지별 inductive bias(??)를 활용할 수 있다

conditioning
- 다른 생성모델들처럼 data distribution $p(x)$ 대신 condition $y$ 를 넣은 $p(x|y)$ 를 배우게 하면 된다
- $\epsilon_{\theta}(z_{t},t)$ 대신 $\epsilon_{\theta}(z_{t},t,y)$ 사용
- 그런데 아직은 다양한 종류의 condition을 다루지는 못함
cross-attention mechanism
- 어떤 modality의 $y$ 가 들어와도 작동하도록 domain specific encoder $\tau_{\theta}$ 를 학습시키자
  - $y$ 를 intermediate representation $\tau_{\theta}(y) \in \mathbb{R}^{M \times d_{\tau}}$ 로 project
- $\tau_{\theta}(y)$ 는 cross-attention layer를 통해서 UNet 중간중간에 매핑할 수 있다
  $\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{T}}{\sqrt{d}} \right) \cdot V$
  $Q=W_{Q}^{(i)}\cdot \phi_{i}(z_{t}),\ \ \ K=W_{K}^{(i)}\cdot \tau_{\theta}(y),\ \ \ V=W_{V}^{(i)}\cdot \tau_{\theta}(y)$
  - 여기서 $\phi_{i}(z_{t})$ 는 $\epsilon_{\theta}$ 를 implement하는 UNet 중간 representation
  - $W_{V}^{(i)} \in \mathbb{R}^{d\times d_{\epsilon}^{i}}$ , $W_{Q}^{(i)} \in \mathbb{R}^{d\times d_{\tau}}$ , $W_{K}^{(i)} \in \mathbb{R}^{d\times d_{\tau}}$ 는 모두 learnable projection matrices
Objective
$L_{LDM}=\mathbb{E}_{\mathcal{E}(x),\epsilon \sim \mathcal{N}(0,1),t} \left[||\epsilon - \epsilon_{\theta}(z_{t},t,\tau_{\theta}(y)) ||_{2}^{2} \right]$
- $\epsilon_{\theta}$ 와 $\tau_{\theta}$ 는 동시에 optimized된다
- $\tau_{\theta}$ 설계 방식은 자유

downsampling factor $f \in \{1,2,4,8,16,32 \}$ 로 비교 실험
- $LDM$ -1 = 기존 DM
Results
- $LDM$ -{1,2}는 학습이 느렸고
- $LDM$ -{32}는 너무 압축해서 information loss가 일어난 탓에 quality가 크게 떨어졌다
- $LDM$ -{4,8}이 sweet spot