[Multimodal_02 ] Zero-Shot Text-to-Image Generation(2021, OpenAI)

fla1512·2023년 2월 27일

Multimodal

Multimodal Study

목록 보기

2/4

Abstract

Text-to-image generation은 better modeling assumptions를 찾는데 예전부터 초점을 두었었다(fixed dataset에서 training하기 위해서)
- 해당 가정들은 복잡한 architecture, auxiliary losses, side information 을 포함할 수도 있다
본 논문에서는 transformer에 기반해 이런 task를 위한 간단한 approach를 제시한다
- autoregressively하게 text와 image tokens를 single stream of data로 model한다
충분한 data와 scale이 있다면, 우리의 approach는 zero-shot fashion에서 evaluate되었을 때 이전의 모델들에 대해서 경쟁력 있을 것이다

1 Introduction

Background

text to image synthesis에 관련된 머신러닝 접근법을 다루겠다
1. Mansimov et al.(2015)

DRAW generative model을 입증
when extended to condition on image captions, could also generate novel visual scenes

Reed et al.(2016)

generative adversarial network 사용을 입증(recurrent variational
auto-encoder가 아닌) => image fidelity의 향사을 보임
해당 시스템이 recognizable properties로 objects를 생성할 수 있음을 보였고, zero-shot이 held-out categories에 generalize할 수 있음을 보임

combination of mehods의 사용으로 연구가 지속됨

이는 improving the generative model architecture with modifications like multi-scale generators, integrating attention and auxiliary losses , and leveraging additional sources of conditioning information beyond just text를 포함

Nguyen et al. (2017)

conditional image generation을 위한 energy-based framework 제시
- 현대의 방법들에 비해서 sample quality에서 큰 향상을 보임
- pre-trained discriminative models를 통합할 수 있음
- text-to-image generation을 MS-COCO에서 pretrained된 captioning model에 적용했을 때 수행 가능함을 입증

Cho et al. (2020)

input을 pretrained cross-modal masked language model에 optimizing하는 방법 제시
visual fidelity에 대한 상당한 향상은 Mansimov et al(2015)의 결과로 나타났지만, samples들은 여전히 object distortion, illogical object placement, unnatural blending 같은 심각한 artifacts에서 문제가 있음

Method

최근의 발전은 large-scale generative models에서 이루어졌고, => further improvement를 위한 possible route를 제안한다
- 구체적으로, compute할 때 model size와 data는 잘 scale되고, autoregressive transformers는 몇 도메인(text, images, audio)에서 인상적인 결과를 얻었다

Experiment & Result

그에 반해 text-to-image generation은 상대적으로 작은 데이터셋(MS-COCO, CUB-200)에서 평가되었다
- dataset size와 modelsize가 현재 approach에서 limiting factor일 수 있을까?
- 본 논문에서 우리는 12-billion parameter autoregressive transformer를 250 million image-text pairs(internet에서 수집)에 훈련하는 것이 => flexible, high fidelity generative model of images controllable through natural language임을 입증
MS-COCO dataset zero-shot에서 training labels를 사용하지 않고도, 높은 질의 image generation을 달성
- It is preferred over prior work trained on the dataset by human evaluators 90% of the time.
image-to-image translation 같은 복잡한 task를 rudimentary level에서 수행할 수 있음을 발견
- This previously required custom approaches (Isola et al., 2017), rather emerging as a capability of a single, large generative
  model.

2 Method

goal: transformer를 autoregressively하게 text와 image tokens를 single stream of data로 model하도록 train하기
하지만 pixels를 image token으로 직접 사용하는 것은 high-resolution images에 대해서 inordinate amount of memory를 필요로 한다
Likelihood objectives는 pixel간의 short-range dependency를 modeling하는 것을 우선순위로 하는 경향이 있다 -> 그래서 대다수의 modeling capacity는 low-frequency structure보다 high-frequency details를 capturing하는데 쓰인다
- low-frequency structure는 makes objects visually recognizable to us

우리는 해당 문제를 두 단계 훈련 과정을 사용해 해결한다

Stage 1

discrete variational autoencoder (dVAE)가 각 256×256 RGB image를 32 × 32 grid of image tokens로 압축하게 한다
- 각 element는 8192 possible values를 assume할 수 있다
- 이는 transformer의 context size를 a factor of 192로 줄인다(visual quality에서 large degradation 없이, fig1)
  
  Fig 1 해석
- 기존 이미지(top)rhk discrete VAE로부터의 reconstructions(bottom)비교
  - encoder는 spatial resolution을 a factor of 8로 downsample한다
  - detail(고양이 털의 질감, storfront의 글씨, 그림의 얇은 선들)은 가끔 사라지긴 했어도 image의 main feature는 여전히 알아볼 수 있다
  - 우리는 8192 vocab size를 사용해서 정보 손실을 mitigate하고자 했다

Stage2

256 BPE-encoded text tokens로 concatenate했다
- 32X32=1024 image tokens와 함께
- autoregressive transformer를 훈련해서 -> 텍스트와 이미지 토큰의 joint distribution을 model하고자 했다
전반적인 과정은 evidence lower bound(ELB)를 maximizing하는 것으로 보일 수 있다
- 어디에 maximizing?
  - model distribution의 joint likelihood에
  - over images x, captions y, and the tokens z for the encoded RGB image
우리는 이 분포를 factorization 을 사용해서 model했고,
- 다음 lower bound를 얻었다
Note that the bound only holds for β = 1, while in practice we find it helpful to use larger values (Higgins et al., 2016).

2.1 Stage one: Learning the Visual Codebook

training의 첫 단계에서 ELB를 φ와 θ에 관해서 maximize했다
- 이는 dVAE를 images에서만 훈련하는 것에 상응한다
initial prior pψ를 to the uniform categorical distribution over the K = 8192 codebook vectors에, qφ를 to be categorical distributions parameterized by the 8192 logits at the same spatial position in the 32×32 grid output by the encoder로 설정했다

ELB는 이제 optimize하는 것이 어려워진다
- qψ가 discrete distribution이고, 우리가 그것을 최대화하기 위해서 reparameterization gradient를 쓸 수 없기 때문이다
- Oord et al. (2017); Razavi et al.(2019)는 해당 문제를 straight-through estimator와 연결된 online cluster assignment procedure를 사용해 해결하고자 했다
- 그 대신 본 논문에서는 gumbel-softmax relaxation을 사용해서
  - qθ에 대한 기대값을 qτφ에 대한 기대값으로 대체
    - 여기서 relaxation(이완)은 온도 τ → 0에 따라 tight해진다.
  - pθ에 대한 likelihood는 log-laplace distribution(#Ap.3)을 사용해서 evaluate된다

relaxed ELB는 Adam을 사용해 최대화된다
- with exponentially weighted iterate averaging
#Ap.2는 hyperparameters에 대한 자세한 설명이다 -> 다음 해당 부분은 안정적인 훈련을 위해 특히 중요하다
1. relaxation temperature와 step size를 위한 specific annealing schedules
- annealing τ to 1/16는 relaxed validation ELB와 true validatain ELB(qφ로) 사이의 간격을 줄이기에 충분했다

encoder의 끝과 decoder의 처음에 1X1 convolutions 쓰기
- reducing the receptive field size for the convolutions around the relaxation led to it generalizing better to the true ELB.
Multiplication of the outgoing activations from the encoder and decoder resblocks by a small constant, to ensure stable training at initialization.

KL weight를 β = 6.6으로 올리는 것이 better codebook usage를 가능하게 하고 궁극적으로 훈련의 끝에서 smaller reconstruction error로 이끌게 함을 발견했다

2.2 Stage two: Learning the Prior

두번째 단계에서 φ와 θ를 고정하고, ELB를 ψ에 관해서 최대화 해 text와 image tokens에 대해서 prior distribution을 학습한다
- 여기서 pψ는 12-billion parameter sparse transformer를 나타낸다

text-image pair가 주어졌을 때 BPE-encode the lowercased caption
- 거의 256 token 사용, vocab size 16,385, image를 vocab size 8192로 32X32=1024 tokens를 사용해서 encode
image tokens는 argmax를 사용해서 얻어짐
- dVAE encoder logits에서, gumbel noise를 넣지 않고도
마침내 text와 image tokens는 concatenate되고 autoregressively하게 model됨(data의 single stream으로서)

transformer는 decoder-only model이다
- 각 image token이 모든 token에 attend할 수 있는(in any one of its 64 self-attention layers)
전체 아키텍처는 #ApB.1
해당 모델에서 사용되는 세 종류의 self-attention mask가 있다
- text-to-text attention에 상응하는 attention masks의 부분은 standard causal mask이고, image-to-image attention의 경우 a row, column, or convolutional attention mask를 사용한다

text caption의 길이를 256 token으로 제한한다
- 비록 last text token과 start of image token 사이에 'padding' positions로서 무엇을 해야 하는지 명확하지 않지만
한 가지 방법은 self-attention operations에서 이 토큰들에 대한 logits를 −∞로 두는 것
대신에 special padding token을 학습하는 것을 택한다
- 각각의 256 text position에 대해서
- 이 토큰들은 no text token일때만 사용된다
- Conceptual Captions에 대한 이전 연구를 통해 우리는 이것이 higher validation loss임을 발견했다, 하지만 out-of-distribution caption에서 더 좋은 성능

text와 image tokens에 대해서 cross-entropy losses를 normalize한다
- by the total number of each kind in a batch of data
image modeling에 원초적으로 관심 있었기에, cross-entropy loss를 multiply한다
- cross-entropy loss for the text: 1/8
- cross-entropy loss for the image: 7/8
objective는 Adam을 사용해 최적화된다
- exponentially weighted iterate averaging
- #AP B.2(training procedure)
validation을 위해서 606,000 images를 reserve했고, convergence에서 overfitting의 sign이 없음을 발견다

2.3 Data Collection

model을 1.2 billion parameters로 up하려는 원초적인 실험은 Conceptual Captions에서 수행
- 해당 데이터셋은 3.3 million text image pairs
- MS-COCO의 연장선으로서 개발
12-billion parameters를 scale up하고자 유사한 scale로 JFT-300M dataset을 만듦
- 인터넷에서 text-image pairs를 250 million개 수집해서
- MS-COCO는 포함하지 않고, Conceptual Captions와 YFCC100M의 filtered subset을 포함
- MS-COCO가 YFCC100M에서 만들어졌기에, 우리의 training data는 MS-COCO validation images의 fraction을 포함함(근데 captions는 아님)
- 이를 #3에 나오는 quantitative results에서 통제했고, 결과에 주목할만한 방향은 없었음을 발견함
- 추가 설명은 #Ac에 있음

2.4. Mixed-Precision Training

GPU memory를 아끼고 처리량을 올리고자 대다수의 파라미터(Adam moments, activations)들은 16-bit precision에 저장되었음
우리는 또한 activation checkpointing을 사용했고
activations를 backward pass동안에 resblocks로 recompute함
모델을 16-bit precision past one billion parameters에서 diverging없이 훈련하는 것이 해당 프로젝트에서 가장 어려운 일이었다

2
We believe the root cause of this instability to be underflow in the 16-bit gradients. Appendix D presents a set of
guidelines we developed to avoid underflow when training
large-scale generative models. Here, we describe one of
these guidelines: per-resblock gradient scaling.

Liu et al., 2020 연구와 유사하게 resblocks로부터의 norms of the activation gradients가 monotonically하게 줄어듦을 발견
- as we move from the earlier resblocks to the later ones
모델이 더 깊고 넓게 만들어져서 later resblock에 대한 activation gradients에 대한 true exponents는 16-bit format의 minimum exponent 밑으로 떨어질 수 있다
결론적으로 0으로 떨어지고 이를 underflow라 부른다
underflow를 제거하는 것이 훈련이 수렴하기에 안정적임 또한 발견했다

Standard loss scaling는 underflow를 피하는 것이 가능하다
- 다음의 상황에서: when the range spanned by the smallest and
  largest activation gradients (in absolute value) fits within
  the exponent range of the 16-bit format
NVIDIA V100 GPU에서 이 exponent range는 five bits로 특정화된다
이것이 같은 사이즈의 바닐라 언어모델을 훈련하기에는 충분하지만, text-to-image model에는 range가 너무 작음을 발견했다

Fig4에서 보이는 fix는 seperate "gradient sclae"을 사용하는 것을 포함한다
- 모델의 각 resblock에서
이는 Flexpoint라 불리는 mixed-precision training을 위한 일반적인 frame의 실제적인 대안으로 보일 수 있다
- with the advantage that specialized GPU kernels are not required.
We found that Sun et al. (2020) had independently developed similar procedure for training convolutional networks in 4-bit precision

2.5. Distributed Optimization

12-billion parameter model은 24GB의 memory를 소모한다
- 16-bit precision에 저장되었을 때
  * 이는 16GB NVIDIA V100 GPU의 메모리를 초과한다
우리는 이를 parameter sharding을 사용해서 다룬다
- Fig5에서 보이듯이, paramter sharding은 to almost completely hide the latency of the intra-machine communication by overlapping it with compute-intensive operations를 가능하게 한다

model을 훈련하기 위해서 사용되는 cluster에서, machines 사이의 bandwidth는 같은 machine에서 GPUs사이의 bandwidth보다 훨씬 낮다
- This makes the cost of the operation used to average the gradient among the machines (all-reduce) the main bottleneck during training
  - able to drastically reduce this cost by compressing the gradients using PowerSGD

시행에서 machine에 있는 각 GPU는 low-rank factors를 계산한다
- for its parameter shard gradients independently of its neighboring GPUs.
low-rank factors가 계산되면
- each machine sets its error buffer to the residual between the uncompressed gradient averaged over its eight GPUs (obtained from reduce-scatter), and the decompressed gradient obtained from the low-rank factors.

PowerSGD replaces the large communication operation for an uncompressed parameter gradient with two, much smaller communication operations for its low-rank factors
For a given compression rank r and transformer activation size dmodel, the compression rate is given by 1 −
5r/(8dmodel) (see Appendix E.1)
Table 1 shows that we can achieve a compression rate of about 85%, independent of model size.

#ApE.2에서 PowerSGD가 scale에서 잘 수행되도록 필요한 사항을 적었다
- Saving memory by accumulating the gradient into the error buffers during backpropagation, rather than allocating separate buff
- Minimizing instances in which we zero out the error buffers (e.g., due to nonfinite values encountered during mixed-precision backpropagation, or when resuming training from a checkpoint).
- Improving numerical stability by using Householder orthogonalization instead of Gram-Schmidt, together with the addition of a small multiple of the identity matrix to the input.
- Avoiding underflow by using a custom 16-bit floating point format for the error buffers, their low-rank factors, and the all-reduce communication operations involving them.

Q matrix에 대한 warm-start procedure가 불필요함을 발견
- we were able to get equivalent results by fixing Q to a random
  gaussian matrix at the start of training, and never updating it.

2.6. Sample Generation

transformer에서 그려진 샘플들을 rerank함
- pretrained contrastive model을 사용해서
caption과 candidate image가 주어졌을 때, contrastive model은 score를 how well the image matches the caption에 기반해서 할당한다
Fig6은 우리가 top k images에서 선정한 samples N의 수를 늘렸을 때의 효과를 보여준다
- 해당 과정은 language-guided search의 종류로 보여질 수 있다
- 그리고 auxiliary text-image matching loss와 유사하다
- 달리 명시되지 않는 한, qualitative와 quantitative에서 얻은 결과들에서 사용된 samples는 temperature reduction 없이 얻어진다(using t=1)
  - 그리고 reranking을 N=512로 사용한다

3. Experiments

3.1. Quantitative Result

다음 세 접근법에 우리 모델인 zero-shot을 evaluate
- AttnGAN (Xu et al., 2018), DMGAN (Zhu et al., 2019), and DF-GANthe last of which reports the best Inception Score (Salimans et al., 2016) and Fréchet Inception Distance (Heusel et al.,2017) on MS-COCO.

human evaluation 수행
- to compare our approach to DF-GAN, the results of which are shown in Figure 7
caption이 주어지면 model로부터의 sample은 majority vote를 받는다
- for better matching the caption 93% of the time
- 그것은 또한 for being more realistic 90% of the time에 대해서 majority vote를 받는다

결과: our model also obtains an FID score on MS-COCO within 2 points of the best prior approach, despite having never been trained on the captions
training data는 filtered subset of YFCC100를 포함하며,
- 그것은 MS-COCO validation set의 21% 이미지다(다음 섹션에서 설명할 de-duplication procedure인)
To isolate this effect, we compute the FID statistics for the validation set both with these images (solid lines) and without them (dashed lines), finding no significant change in the results

3.2. Data Overlap Analysis

3.3. Qualitative Findings

4 Conclusion

우리는 text-to-image generation에 대한 간단한 approach를 조사했다
- autoregressive transformer에 기반을 두고 있고
- scale에서 시행되었을 때다
scale이 improved generalizatin으로 이끌 수 있음을 발견했다
- 이전 domain-specific approaches보다 상대적으로 zero-shot performance에서
- capabilites의 range에서
  - single generative model에서 나온다
우리가 찾은 결과들은 scale을 활용해서 generalization을 향상하는 것이 task에서 유용한 방법이라고 제안한다

논문 Review

Good Point

아쉬운 점

related work가 없다 -> 아마 intro에서 세부적으로 5단계로 나누어서 이전 연구 흐름을 알려주어서 인 것 같다
intro에서 다른 논문에 비해서 상대적으로 Background, 기존 연구 Limitation, limitation 극복하기 위한 본 논문의 Approach, Method, Experiment & Result, Contribution을 살펴보기가 어렵다는 생각이 들었다

fla1512

이전 포스트

[Multimodal_01] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision(ICML 2021)

다음 포스트