Text-To-Image Tutorial (CVPR2020)

murlocKing·2022년 1월 30일

Computer Vision Multimodal NLP

Tutorial

목록 보기

1/1

category
- text-to-image synthesis
  - stackGAN, AttnGAN, TAGAN, ObjGAN
- text-to-video synthesis
  - GAN-based methods, VAE-based methods, StoryGAN
- Dialogue-based Image Synthesis
  - ChatPainter, CoDraw, SeqAttnGAN
generative models
- 주로 GAN이나 VAE를 사용
GAN
- normal distribution을 따르는 input Z가 generator G를 통과해서 G(z)라는 생성물을 만들었을 때, 그것과 원본 데이터 x가 가능한 한 비슷하게 만들어져야 한다
- 즉, 다시 말해 둘 사이의 분포가 비슷해지도록 학습이 이루어짐
VAE
- VAE는 autoencoder인데, 이것의 encoding 분포는 학습 동안, regularised된다. 왜냐하면, 이것의 latent space가 새로운 데이터를 생성하기 위해 좋은 특성을 가지도록 하기 위해
text-to-image synthesis
- normal distribution을 따르는 z에 text data가 concat되어 입력으로 들어가고, 이것이 genrative model을 지나 image를 생성하면, 생성된 image가 discriminator network를 지나게 된다. discriminator network를 통과하는 과정에서, input에 넣었던 text data를 image data의 feature vector와 concat하고 그것과 실제 이미지를 discriminate한다
StackGAN
- stage1
  - 64x64 image 생성
  - 구조적인 정보
  - 낮은 디테일
- stage2
  - stage1의 output을 필요로 함
  - 256x256으로 upsample
  - 높은 디테일, 진짜 사진 같은 모습
- 두 stage 모두 같은 상태의 text input이 들어가야 함
AttnGAN
- natural language의 주어진 단어들에 대해, 관련된 단어들을 좀 더 주의 깊게 살펴봄(attention)
- 문장 단위의 global information과 word level information을 모두 capture
- AttnGAN은 object detailed information을 좀 더 잘 생성할 수 있다
MirrorGAN
- using a semantic-preserving text-to-image-to-text framework
text-to-image synthesis
- 현재의 흐름은 stackGAN, AttnGAN을 따르는 추세
  - generation quality는 CUB, flowers datasets에 매우 좋은 성능을 보임
  - 그러나 COCO dataset과 같이 복잡한 것에는 좋은 성능을 보이고 있지 않음
- what evaluation?
  - IS, FID and human evaluation
- Technique challenges
  - 어떻게 거대한 양의 어휘를 다룰 것인지
  - 어떻게 다양한 객체를 생성하고, 그들간의 관계를 모델링할 것인지
ObjGAN
- Object-centered text-to-image synthesis for complex scenes
Object Pathways
- using a separate net to model the objects/relations
Text-Adaptive GAN (TAGAN)
- natural language description을 사용해서 image를 조작
ManiGAN
- ACM(text-image affine combination module)과 DCM(detail correction module)로 이루어짐
Text-to-video synthesis
- task는 text description이 주어졌을 때, 이미지의 sequence를 생성하는 것
- text input, generated gist, generated video
T2V
- VAE framework combining the text and gist information
TFGAN
- convolutional filter generation을 기반으로 multi-scale text-conditioning scheme GAN
StoryGAN
- short story (sequence of sentences) → sequence of images
- 여러 문장들이 주어지면, 그 문장을 하나의 프레임 이미지로 생성하는 것
- 여러 문장들이 모여서 이야기가 되는 것 처럼, 여러 이미지가 모이면 video가 된다
Dialogue-based Image synthesis
- 단순히 한 줄의 text로 이미지를 수정하는 것이 아니라, 대화 text를 이용해 image를 수정하게 된다.
Chat-crowd
- dialog-based platform for visual layout composition
Neural Painter
- 각각의 time 마다 문장을 랜덤하게 샘플링하고 GAN을 통해서만 backprop한다
ChatPainter
- multi-turn dialogues를 기반으로 하는 image generation task의 새로운 데이터 셋
CoDraw
- teller와 drawer가 이야기를 주고받으며 drawer는 teller가 전달한 메시지를 기반으로 이미지를 생성
SeqAttnGAN
- AttnGAN에서 sequential attention을 사용해서 확장된 버전
- 두가지 새로운 데이터셋 → zap-seq and DeepFashion-Seq
Text(dialogue)-to-video synthesis
- 최근에 여러번의 시도가 있음
- 몇가지 예비 결과가 보여짐
- 좋은 벤치 마크가 존재
- 새로운 평가
- 일관성이 있는 생성, disentangled learning, compositional generation

murlocKing

재미있는 딥러닝

Text-To-Image Tutorial (CVPR2020)

Tutorial

0개의 댓글