AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

์ด์€์ƒยท2024๋…„ 4์›” 29์ผ
0

๋…ผ๋ฌธ๋ฆฌ๋ทฐ

๋ชฉ๋ก ๋ณด๊ธฐ
9/23

๐Ÿ“„AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

written by Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Duhua Lin, and Bo Dai


Introduction

T2I(Text-to-Image) diffusion models๋Š” ์•„ํ‹ฐ์ŠคํŠธ์™€ ์•„๋งˆ์ถ”์–ด๋“ค์ด ๋น„์ฃผ์–ผ ์ฝ˜ํ…ํŠธ๋ฅผ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•ด ๋งŽ์€ ์˜ํ–ฅ์„ ์ฃผ์—ˆ์Œ

์ง€๊ธˆ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๋“ค์ด ๊ฐœ๋ฐœ๋˜์—ˆ์ง€๋งŒ, ๊ทธ๋“ค์€ ์ •์  ์ด๋ฏธ์ง€๋งŒ ์ƒ์„ฑํ•ด๋‚ด๊ธฐ ๋•Œ๋ฌธ์— ์• ๋‹ˆ๋ฉ”์ด์…˜๊ณผ ๊ฐ™์€ ๋™์  ์ฝ˜ํ…์ธ  ์ƒ์„ฑ์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š๊ณ  ๋”ํ•˜์—ฌ ๋น„์šฉ๊ณผ ๊ณ„์‚ฐ์  ๋น„ํšจ์œจ๋กœ ์ธํ•ด ์‹ค์šฉ์ ์ด์ง€ ์•Š์Œ

๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” AnimateDiff๋Š” ๊ธฐ์กด์˜ ๊ณ ํ’ˆ์งˆ ๊ฐœ์ธํ™”๋œ T2I ๋ชจ๋ธ์„ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ƒ์„ฑ๊ธฐ๋กœ ์ง์ ‘ ๋ณ€ํ™˜ ๊ฐ€๋Šฅํ•จ

ํ•ด๋‹น ๋ชจ๋ธ์˜ ํ•ต์‹ฌ์€ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋ถ€ํ„ฐ ํ•ฉ๋ฆฌ์ ์ธ ๋ชจ์…˜์„ ํ•™์Šตํ•˜๋Š” ํ”Œ๋Ÿฌ๊ทธ ์•ค ํ”Œ๋ ˆ์ด ๋ชจ์…˜ ๋ชจ๋“ˆ์„ ํ›ˆ๋ จํ•˜๋Š” ์ ‘๊ทผ๋ฒ•

AnimateDiff์˜ ํ›ˆ๋ จ ๋‹จ๊ณ„

  1. ๊ธฐ๋ณธ T2I์— ๋„๋ฉ”์ธ ์–ด๋Œ‘ํ„ฐ๋ฅผ ๋ฏธ์„ธ์กฐ์ •ํ•˜์—ฌ ๋Œ€์‚ฐ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์…‹์˜ ์‹œ๊ฐ์  ๋ถ„ํฌ์™€ ์ผ์น˜์‹œํ‚ด
  2. ๊ธฐ๋ณธ T2I๋ฅผ ํ•จ๊ป˜ ํ™•์žฅํ•˜๊ณ , ์ƒˆ๋กœ์šด ์ดˆ๊ธฐํ™”๋œ ๋ชจ์…˜ ๋ชจ๋“ˆ์„ ์†Œ๊ฐœํ•˜์—ฌ ๋น„๋””์˜ค์—์„œ ๋ชจ์…˜ ๋ชจ๋ธ๋ง ์ตœ์ ํ™”
  3. ๋ฏธ์„ธ ์กฐ์ • ๊ธฐ์ˆ (LoRA) ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ์…˜ ๋ชจ๋“ˆ์„ ํŠน์ • ๋ชจ์…˜ ํŒจํ„ด์— ์ ์‘์‹œํ‚ด

์ด๋Ÿฌํ•œ ์ ‘๊ทผ์„ ํ†ตํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”,

  1. ํŠน์ • ๋ฏธ์„ธ ์กฐ์ • ์—†์ด ๊ฐœ์ธํ™”๋œ T2I์˜ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ƒ์„ฑ ๋Šฅ๋ ฅ ํ™œ์„ฑํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์‹ค์šฉ์  ํŒŒ์ดํ”„๋ผ์ธ ์ œ์‹œ
  2. Transformer architecture๊ฐ€ ๋ชจ์…˜ ์‚ฌ์ „์„ ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐ ์ถฉ๋ถ„ํ•จ์„ ๊ฒ€์ฆํ•˜๊ณ , ๋น„๋””์˜ค ์ƒ์„ฑ์— ๋Œ€ํ•œ ์ค‘์š”ํ•œ ํ†ต์ฐฐ๋ ฅ ์ œ๊ณต
  3. ์ƒˆ๋กœ์šด ๋ชจ์…˜ ํŒจํ„ด์— ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ์…˜ ๋ชจ๋“ˆ์„ ์ ์‘์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๊ฐ€๋ฒผ์šด ๋ฏธ์„ธ ์กฐ์ • ๊ธฐ์ˆ ์ธ MotionLoRA ์ œ์•ˆ
  4. ์ ‘๊ทผ๋ฐฉ์‹์„ ๋Œ€ํ‘œ์ ์ธ ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ชจ๋ฐ๋กœ๊ฐ€ ๋น„๊ตํ•˜์—ฌ ํฌ๊ด„์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ณ  ๋‹ค๋ฅธ ์ƒ์—…์  ๋„๊ตฌ์™€ ๋น„๊ต, ํ˜ธํ™˜์„ฑ ๋ณด์—ฌ์คŒ

์— ๋Œ€ํ•˜์—ฌ ์ด์•ผ๊ธฐ ํ•จ


Text-to-image diffusion models

Text-to-image ์ƒ์„ฑ์„ ์œ„ํ•œ diffusion models๋Š” ์ตœ๊ทผ ๋งŽ์€ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ์Œ

  • GLIDE
    text conditions์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•˜๊ณ  incorporating classifier guidance๊ฐ€ ๋”์šฑ ๋งŒ์กฑ์Šค๋Ÿฌ์šด ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์— ๋Œ€ํ•ด ์ฆ๋ช…
  • DALL-E2
    CLIP ๊ณต๋™ ํŠน์ง• ๊ณต๊ฐ„ ํ™œ์šฉํ•˜์—ฌ text-image alignment ํ–ฅ์ƒ
  • Imagen ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ๊ณผ cascade architecture ํ†ตํ•ฉ ํ†ตํ•ด photorealistic results ๋‹ฌ์„ฑ
  • Latent Diffusion Model(=Stable Diffusion)
    diffusion process๋ฅผ auto-encoder์˜ ์ž ์žฌ ๊ณต๊ฐ„์œผ๋กœ ์ด๋™ํ•˜์—ฌ ํšจ์œจ์„ฑ ํ–ฅ์ƒ
  • eDiff-I
    ๋‹ค์–‘ํ•œ ์ƒ์„ฑ ๋‹จ๊ณ„์— ํŠนํ™”๋œ ํ™•์‚ฐ ๋ชจ๋ธ ์•™์ƒ๋ธ”์„ ์‚ฌ์šฉ

Personalizing T2I models

์‚ฌ์ „ ํ›ˆ๋ จ๋œ T2I๋กœ ์ฐฝ์ž‘ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋งŽ์€ ์ž‘์—…์ด ํšจ์œจ์ ์ธ ๋ชจ๋ธ ๊ฐœ์ธํ™”์— ์ดˆ์  ๋งž์ถ”๊ณ  ์žˆ์Œ

  • DreamNooth
    ๋ณด์กด ์†์‹ค ์‚ฌ์šฉํ•˜์—ฌ ๋„คํŠธ์›Œํฌ ์ „์ฒด๋ฅผ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๊ณ  ์†Œ์ˆ˜ ์ด๋ฏธ์ง€๋งŒ ์‚ฌ์šฉ
  • Textual Inversion
    ๊ฐ ์ƒˆ๋กœ์šด ๊ฐœ๋…์— ๋Œ€ํ•œ ํ† ํฐ ์ž„๋ฒ ๋”ฉ ์ตœ์ ํ™”
  • Low-Rank Adaption
    ๊ธฐ์กด T2I์— ์ถ”๊ฐ€์ ์ธ LoRA ๋ ˆ์ด์–ด ๋„์ž…ํ•˜์—ฌ ๊ฐ€์ค‘์น˜ ์ž”์ฐจ๋งŒ ์ตœ์ ํ™”

Annimating personalized T2Is

๊ธฐ์กด ์ž‘์—… ๋งŽ์ง€ ์•Š์Œ

  • Tune-a-Video
    ๋‹จ์ผ ๋น„๋””์˜ค์—์„œ ์†Œ์ˆ˜์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ฏธ์„ธ ์กฐ์ •
  • Text2Video-Zero
    ์‚ฌ์ „ ์ •์˜๋œ ์•„ํ•€ ํ–‰๋ ฌ ๊ธฐ๋ฐ˜ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ T2I ์• ๋‹ˆ๋ฉ”์ด์…˜ํ™”ํ•˜๋Š” training-free method ์†Œ๊ฐœ

Preliminary

๋ณธ ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœํ•˜๋Š” AnimateDiff์˜ base T2I model์ธ Stable Diffusion๊ณผ LoRA์— ๋Œ€ํ•˜์—ฌ ์†Œ๊ฐœํ•จ

Stable Diffusion(SD)

open-sourced, well-developted community with many high-quality personalized T2I models for evaluation์˜ ์ด์œ ๋กœ base T2I ๋ชจ๋ธ๋กœ ์„ ์ •

forward diffusion ์‹

denoising network

MSE loss ํ†ตํ•ด ๊ณ„์‚ฐ๋จ

  • ฯตฮธ(ยท) is implemented as a UNet (Ronneberger et al., 2015) consisting of pairs of down/up sample blocks at four resolution levels, as well as a middle block
  • Each network block consists of ResNet spatial self-attention layers, and cross-attention layers that introduce text conditions

Low-rank adaptation(LoRA)

approach that accelerates the fine-turning of large models and is first proposed for language model adaption

model์˜ parameters๋ฅผ retrainingํ•˜๋Š” ๊ฒƒ ๋Œ€์‹  pairs of rank-decomposition matrices ๋”ํ•˜์—ฌ optimizes only these newly introduced weightsํ•จ

๊ธฐ์กด์˜ weights๋Š” frozen์‹œํ‚ค๊ณ  ํ•™์Šต๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ œํ•œํ•จ์œผ๋กœ์จ catastrophic forgetting ๋ฐœ์ƒ ํ™•๋ฅ ์„ ๋‚ฎ์ถค

LoRA๋Š” ์˜ค์ง attention layers์—๋งŒ ์ ์šฉ๋จ


AnimateDiff

core of this method

learning transferable motion priors from video data, which can be applied to pesonalized T2I without specific tuning

inference time์— our motion module(ํ‘ธ๋ฅธ์ƒ‰)๊ณผ optional MotionLoRA(์ดˆ๋ก์ƒ‰)๋Š” directly personalized T2I์— insert๋จ. ์ด๋ฅผ ํ†ตํ•ด animation generator(์ˆœ์ฐจ์ ์œผ๋กœ ๋…ธ์ด์ฆˆ๋ฅผ ์—†์•ฐ์œผ๋กœ์จ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ƒ์„ฑํ•˜๋Š” ์ƒ์„ฑ์ž)๋ฅผ ๊ตฌ์„ฑํ•จ

AnimateDiff๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ์š”์†Œ์ธ domain adapter, motion module, MotionLoRA๋ฅผ ํ•™์Šต์‹œํ‚ด์œผ๋กœ์จ ์œ„์˜ ๊ตฌ์กฐ๋„๋ฅผ achieveํ•  ์ˆ˜ ์žˆ์—ˆ์Œ

Alleviate Negative Effects from Training Data with Domain Adapter

๋น„๋””์˜ค ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์€ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์— ๋น„ํ•ด ์‹œ๊ฐ์  ํ’ˆ์งˆ ๋‚ฎ์•„ ๋ชจ์…˜ ๋ธ”๋Ÿฌ, compression artifacts. watermarks ๋“ฑ์˜ ๋ฌธ์ œ ๋ฐœ์ƒ ๊ฐ€๋Šฅํ•จ. ์ด๋Ÿฌํ•œ ํ’ˆ์งˆ ํ€„๋ฆฌํ‹ฐ ๋‚ฎ์Œ์€ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ์— ๋ถ€์ •์  ์˜ํ–ฅ ๋ฏธ์น  ์ˆ˜ ์žˆ์Œ

ํ€„๋ฆฌํ‹ฐ์˜ ์ฐจ์ด๋ฅผ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ  ๊ธฐ์กด T2I์˜ knowledge๋ฅผ ๋ณด์กดํ•˜๊ธฐ ์œ„ํ•˜์—ฌ fit the domain information to a separate network๋ฅผ ์‹ค์‹œํ•จ. ์ถ”๋ก  ์‹œ ๋„๋ฉ”์ธ ์–ด๋Œ‘ํ„ฐ ์ œ๊ฑฐํ•จ์œผ๋กœ์จ domain gap์œผ๋กœ ์ธํ•œ ๋ถ€์ •์  ์˜ํ–ฅ์„ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ์Œ

domain adapter layer๋Š” LoRA๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ตฌํ•œ๋˜๊ณ  ๊ธฐ๋ณธ T2I์˜ self-/cross-attention layer์— ์‚ฝ์ž…ํ•จ

Learn Motion Priors with Motion Module

  1. 2์ฐจ์› ํ™•์‚ฐ ๋ชจ๋ธ์„ 3์ฐจ์› ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์™€ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ™•์žฅ
  2. ํšจ์œจ์ ์ธ ์ •๋ณด ๊ตํ™˜์„ ์œ„ํ•œ ํ•˜์œ„ ๋ชจ๋“ˆ ์„ค๊ณ„

๋„คํŠธ์›Œํฌ ํ™•์žฅ์€ ์ด๋ฏธ์ง€ ๋ ˆ์ด์–ด๋ฅผ ๋น„๋””์˜ค ํ”„๋ ˆ์ž„์— ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉํ•˜์—ฌ ๊ธฐ์กด์˜ ๊ณ ํ’ˆ์งˆ ์ฝ˜ํ…์ธ  ์œ ์ง€
๋ชจ๋“ˆ ์„ค๊ณ„๋Š” ์ตœ๊ทผ ๋น„๋””์˜ค ์ƒ์„ฑ ์ž‘์—…์—์„œ ํƒ๊ตฌ๋œ ์—ฌ๋Ÿฌ ๋””์ž์ธ ๊ธฐ๋ฐ˜์œผ๋กœ transformer architecture ์‚ฌ์šฉํ•˜๊ณ , ์‹œ๊ฐ„ ์ถ•์— ๋งž๊ฒŒ ์•ฝ๊ฐ„์˜ ์ˆ˜์ •์„ ํ†ตํ•ด time transformer๋กœ ์ฐธ์กฐ

์ด๋ฅผ ํ†ตํ•ด ์‹œ๊ฐ์  ๋‚ด์šฉ์˜ ๋ณ€ํ™” ํ•™์Šตํ•˜์—ฌ ์• ๋‹ˆ๋ฉ”์ด์…˜ ํด๋ฆฝ์˜ ์šด๋™ ์—ญํ•™ ๊ตฌ์„ฑํ•˜๋„๋ก T2I model ํ™•์žฅ์‹œํ‚ด

Adapt to New Motion Patterns with Motion LoRA

pre-trained motion module์€ ์ผ๋ฐ˜์ ์ธ ์šด๋™ ์šฐ์„  ์ˆœ์œ„๋ฅผ ์บก์ฒ˜ํ•˜์ง€๋งŒ, ์ƒˆ๋กœ์šด ์šด๋™ ํŒจํ„ด์— ํšจ๊ณผ์ ์œผ๋กœ ์ ์‘ํ•ด์•ผ ํ•  ๋•Œ ๋ฌธ์ œ ๋ฐœ์ƒ

ํ•ด๊ฒฐ ์œ„ํ•˜์—ฌ ์ ์€ ์ˆ˜์˜ ์ฐธ์กฐ ๋น„๋””์˜ค์™€ ํ›ˆ๋ จ ๋ฐ˜๋ณต ํ†ตํ•ด ๋ชจ์…˜ ๋ชจ๋“ˆ์„ ํŠน์ • ํšจ๊ณผ์— ๋Œ€ํ•ด ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋Š” ์š”์œจ์ ์ธ ๋ฏธ์„ธ ์กฐ์ • ์ ‘๊ทผ ๋ฐฉ๋ฒ•์ธ MotionLoRA๋ฅผ ์‚ฌ์šฉํ•จ

MotionLoRA๋Š” LoRA ๋ ˆ์ด์–ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ์šด๋™ ํŒจํ„ด์˜ ์ฐธ์กฐ ๋น„๋””์˜ค์—์„œ ํ›ˆ๋ จ๋˜๊ณ , ์ ์€ ์ž์›์œผ๋กœ๋„ ์ข‹์€ ๊ฒฐ๊ณผ ์–ป์„ ์ˆ˜ ์žˆ์Œ. ์ด๋Ÿฌํ•œ ๋‚ฎ์€ ์ˆœ์œ„ ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐœ๋ณ„์ ์œผ๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ถ”๋ก  ์‹œ์— ๋‹ค์–‘ํ•œ ๋ชจ์…˜ ํšจ๊ณผ ๋‹ฌ์„ฑ ๊ฐ€๋Šฅํ•จ

์ด๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉ์ž๋Š” ๋น„์šฉ ๋ถ€๋‹ด ์—†์ด ๋ชจ์…˜ ๋ชจ๋“ˆ์„ ์›ํ•˜๋Š” ํšจ๊ณผ์— ๋งž๊ฒŒ ์กฐ์ • ๊ฐ€๋Šฅํ•จ

AnimateDiff in Practice

  1. training
    domain adapter๋Š” train with original objective
    motion module and MotinoLoRA, as part of an animation generator, use a similar objective with minor modifications to accommodate higher dimension video data

  2. inference
    ์ถ”๋ก  ์‹œ personalized T2I model์€ ์ฒ˜์Œ์— inflated๋˜๊ณ , motion module for general animation generation์ด injected๋จ.
    ์„ ํƒ์ ์œผ๋กœ MotinoLoRA๊ฐ€ personalized motion์„ ์‚ฌ์šฉํ•˜์—ฌ ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ์ƒ์„ฑํ•จ
    domain adapter์˜ ๊ฒฝ์šฐ, ๋‹จ์ˆœํžˆ ์ถ”๋ก  ๋•Œ ๋ฒ„๋ฆฌ๋Š” ๋Œ€์‹  ๊ฐœ์ธํ™”๋œ T2I๋ชจ๋ธ์— ์ฃผ์ž…ํ•˜๊ณ  ์Šค์ผ€์ผ๋Ÿฌ alpha๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ๊ธฐ์—ฌ ์กฐ์ • ๊ฐ€๋Šฅํ•จ


Experiments

Qualitative Results

Quantitative Comparison


user study์—์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋ชจ๋ธ์ด ๋†’์€ ๊ฐ’์„ ๋ณด์—ฌ์คŒ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
CLIP metric์—์„œ ๋˜ํ•œ ๋†’์€ ๊ฐ’์„ ๊ฐ€์ง์„ ์•Œ ์ˆ˜ ์žˆ์Œ


Conclusion

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” quality๋ฅผ ํฌ์ƒํ•˜์ง€ ์•Š๊ณ  pre-trained domain knowledge๋ฅผ ์žƒ์ง€ ์•Š๊ณ ๋„ ํ•œ ๋ฒˆ์— ๊ฐœ์ธํ™”๋œ T2I ๋ชจ๋ธ์„ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ƒ์„ฑ์šฉ์œผ๋กœ ์ง์ ‘ ๋ณ€ํ™˜ํ•˜๋Š” Animatediff๋ฅผ ์ œ์•ˆํ•จ

์ด๋ฅผ ์œ„ํ•ด ์˜๋ฏธ ์žˆ๋Š” ์šด๋™ ์šฐ์„  ์ˆœ์œ„๋ฅผ ํ•™์Šตํ•˜๊ณ  ์‹œ๊ฐ์  ํ’ˆ์งˆ ์ €ํ•˜๋ฅผ ์™„ํ•˜ํ•˜๋ฉฐ MotionLoRA๋ผ๋Š” ๊ฒฝ๋Ÿ‰ ๋ฏธ์„ธ ์กฐ์ • ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜์—ฌ ์šด๋™ ๊ฐœ์ธํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ๊ตฌ์„ฑ ์š”์†Œ ๋ชจ๋“ˆ์„ ์„ค๊ณ„ํ•จ

AnimateDiff๋Š” ๊ธฐ์กด์˜ ๋‚ด์šฉ ์ œ์–ด ์ ‘๊ทผ ๋ฐฉ์‹๊ณผ์˜ ํ˜ธํ™˜์„ฑ์„ ๋ณด์—ฌ ์ถ”๊ฐ€์ ์ธ ํ›ˆ๋ จ ๋น„์šฉ ์—†์ด ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์ƒ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ

AnimateDiff๋Š” ๊ฐœ์ธํ™”๋œ ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ์œ„ํ•œ ํšจ๊ณผ์ ์ธ ๊ธฐ์ค€์„ ์ œ๊ณตํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ ์‘์šฉ๋ถ„์•ผ์— ๋Œ€ํ•œ ์ž ์žฌ๋ ฅ์„ ์ง€๋‹ˆ๊ณ  ์žˆ์Œ

0๊ฐœ์˜ ๋Œ“๊ธ€