[논문 리뷰-story, video]StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

kiteday·2024년 5월 10일

0

논문리뷰

목록 보기

4/5

https://storydiffusion.github.io/

ByteDance에서 한 연구. 그래서 paper만 있으리라 생각했는데 무려 code와 hugging face demo(언제 사라질 지 모르지만)까지 제공한다.

Introduction

consistent image generation과 long sequence video generation을 중심으로 하는 모델이다.

Traing-free, zero-shot 모델로 기존 diffusion-based generation model에 있는 UNet의 attention을 변형한 Consistent self-attention을 제안한다.

Method

method는 크게 이미지파트와 비디오 파트로 나뉜다.

Training-free image generation with consistent SA

기존 attention 식은 다음과 같다.

𝑂_𝑖=𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄_𝑖,𝐾_𝑖, 𝑉_𝑖)

각각의 query, key, value는 자기 자신으로 self-attention을 진행한다.

이에 반해, 논문에서 제시하는 Consistent self-attention은 일종의 Randsample 모듈을 하나 붙인 형태이다.
Randsample은 각 이미지 프레임 $I = [I_1, I_2, ...]$ 에서 배치 크기만큼 sampling한 것이다. $S_i$ 로 표현한다.

𝑆_𝑖=𝑅𝑎𝑛𝑑𝑆𝑎𝑚𝑝𝑙𝑒(𝐼_1,𝐼_2, …, 𝐼_{𝑖−1},𝐼_{𝑖+1},…,𝐼_{𝐵−1},𝐼_𝐵)

이 sampled token $S_i$ 를 입력으로 들어온 이미지 $I_i$ 와 합친 set $P_i$ (paired token set)를 만든다.

Consistent SA는 $P_i$ 를 attention에 이용하는 것이 핵심이다.

𝑂_𝑖=𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄_𝑖,𝐾_{P_𝑖}, 𝑉_{P_𝑖})

각 배치 이미지들을 위 attention을 이용한다.

Algorithm

Semantic motion predictor for video generation

Semantic motion predictor는 image semantic space에서 spatial information을 유지하여 비디오를 생성하기 위해 제안하는 구조이다.

1) image encoding → 2) semantic space predict → 3) generation video
이렇게 3단계를 거쳐 완성된다.

image encoding

이미지의 RGB를 vecter + semantic space로 바꿔주는 encoder $E$ 는 pretrained CLIP모델을 사용한다.

𝐾_𝑠, 𝐾_𝑒=𝐸(𝐹_𝑠, 𝐹_𝑒)

start frame과 end frame인 $𝐹_𝑠, 𝐹_𝑒$ 를 pretrained CLIP으로 encoding한다.

semantic space predict

𝑃_1, 𝑃_2,…,𝑃_𝑙=𝐵(𝐾_1, 𝐾_2, …,𝐾_𝐿)

encoding된 결과를 image prompt로 사용하고 transfomer block $B$ 를 이용해 각 frame마다 연관도를 제어한다.

generation video

𝑉_𝑖=𝐶𝑟𝑜𝑠𝑠𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑉_𝑖, 𝑐𝑜𝑛𝑐𝑎𝑡(𝑇,𝑃_𝑖 ),𝑐𝑜𝑛𝑐𝑎𝑡(𝑇,𝑃_𝑖 ))

앞서 만든 image prompts embedding $P$ 와 text prompt embedding $T$ 를 cross attention하여 각각의 video frame feature를 predicted 한다.

Loss

비디오간의 consistency를 위해 이전 방식을 optimize하는 방법으로 접근한다.

𝐿𝑜𝑠𝑠=𝑀𝑆𝐸(𝐺, 𝑂)

G, O는 $𝐺=(𝐺_1,𝐺_2,…,𝐺_𝐿)$ : ground truth, $𝑂=(𝑂_1, 𝑂_2,…,𝑂_𝐿)$ : frame predict translation video
를 의미한다.

Experience

Implemetation

8 A100 GPUs
Learning rate : 1e-4
Image semantic space model : OpenCLIP ViT-H-14

Comparison

이미지 생성에서는 IP-Adapter, PhotoMaker 모델과 함께 얼마나 캐릭터 특성을 유지하였는지를 중심으로 본다.

text prompt와 이미지 연관성, 캐릭터 이미지 유사성 모두 CLIP loss로 측정한다.

다음으로 비디오 결과이다. SEINE, SparseCtrl과 비교하였다.

전반적인 consistancy가 다이나믹한 동작에도 잘 지켜짐을 주장하고 있다.

Demo

hugging face demo 버전도 같이 제공하길래 한 번 inference 해봤다.

기본 캐릭터 description은 이렇게 주고 추가적으로 각 scene별 description을 주어 봤다.

Comic Description (each line corresponds to a frame)

University is a private comprehensive university in South Korea.

founded in 1937 as Girls' High School of Fine Arts, it established Sangmyung Women's Teachers' College in 1965.

[NC] walks campus with her friends

[NC] in the class

[NC] has a major of computer engineering

생각보다 강력한 hausination을 가진 것 같다. 이미지 자체는 깔끔한데 제어는 잘 모르겠다.

공부

이전 포스트

[논문 리뷰-story] AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort

다음 포스트

[논문 모음] Story visualization

0개의 댓글

관련 채용 정보