[2025/W14] ๐Ÿค— Weekly AI Research

Skyยท2025๋…„ 4์›” 5์ผ

Weekly AI Research Digest

๋ชฉ๋ก ๋ณด๊ธฐ
14/89

2025๋…„ 14์ฃผ์ฐจ์— ๊ณต๊ฐœ๋œ ์ฃผ๋ชฉํ• ๋งŒํ•œ AI ๋ถ„์•ผ์˜ ๋…ผ๋ฌธ๋“ค์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

TL;DR

AI ์—์ด์ „ํŠธ ๋ฐ ์‹œ์Šคํ…œ ๋ถ„์•ผ์—์„œ Foundation Agents๋Š” ๋‡Œ๊ณผํ•™์—์„œ ์˜๊ฐ์„ ์–ป์€ ์ง€๋Šฅํ˜• ์—์ด์ „ํŠธ ์„ค๊ณ„๋ฅผ, AnimeGamer๋Š” MLLM ๊ธฐ๋ฐ˜ ๋ฌดํ•œ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. NLP ๋ถ„์•ผ์—์„œ AdaptiVocab์€ ๋„๋ฉ”์ธ ํŠนํ™” ์–ดํœ˜ ์ ์‘์„ ํ†ตํ•œ LLM ํšจ์œจ์„ฑ ํ–ฅ์ƒ์„, Open-Reasoner-Zero๋Š” ์ถ”๋ก  ๋Šฅ๋ ฅ ๊ฐ•ํ™”๋ฅผ ์œ„ํ•œ ์˜คํ”ˆ์†Œ์Šค RL ์ ‘๊ทผ๋ฒ•์„ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ MergeVQ๋Š” ํ† ํฐ ๋ณ‘ํ•ฉ๊ณผ ์–‘์žํ™”๋ฅผ ํ†ตํ•ฉํ•˜๊ณ , TextCrafter๋Š” ๋ณต์žกํ•œ ์‹œ๊ฐ์  ํ…์ŠคํŠธ ๋ Œ๋”๋ง ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์—ฐ๊ตฌ๋กœ MoCha๋Š” ์˜ํ™”๊ธ‰ ๋Œ€ํ™”ํ˜• ์บ๋ฆญํ„ฐ ํ•ฉ์„ฑ์„, Any2Caption์€ ๋‹ค์–‘ํ•œ ์กฐ๊ฑด ํ•ด์„์„ ํ†ตํ•œ ๋น„๋””์˜ค ์ƒ์„ฑ์„, RISEBench๋Š” ์ถ”๋ก  ๊ธฐ๋ฐ˜ ์‹œ๊ฐ์  ํŽธ์ง‘ ํ‰๊ฐ€๋ฅผ, Visual-Spatial Reasoning์€ R1-Zero ํ›ˆ๋ จ ๊ธฐ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ปดํ“จํ„ฐ ๊ทธ๋ž˜ํ”ฝ์Šค ๋ฐ ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋ถ„์•ผ์—์„œ๋Š” DreamActor-M1์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฐ€์ด๋“œ ์ธ๊ฐ„ ์• ๋‹ˆ๋ฉ”์ด์…˜๊ณผ TokenHSI์˜ ์ž‘์—… ํ† ํฐํ™” ๊ธฐ๋ฐ˜ ์ธ๊ฐ„-์žฅ๋ฉด ์ƒํ˜ธ์ž‘์šฉ ํ•ฉ์„ฑ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค.

AI ์—์ด์ „ํŠธ ๋ฐ ์‹œ์Šคํ…œ ๋ถ„์•ผ

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper, Project

์ด ๋…ผ๋ฌธ์€ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ๋ฐœ์ „์ด ์ธ๊ณต์ง€๋Šฅ์— ๊ฐ€์ ธ์˜จ ๋ณ€ํ™”์™€ ์ง€๋Šฅํ˜• ์—์ด์ „ํŠธ ๊ฐœ๋ฐœ์— ๋Œ€ํ•ด ํฌ๊ด„์ ์œผ๋กœ ๋‹ค๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋‡Œ๊ณผํ•™์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ ๋ชจ๋“ˆ์‹ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ์ง€๋Šฅํ˜• ์—์ด์ „ํŠธ๋ฅผ ์„ค๊ณ„ํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฐ๊ตฌ๋Š” ๋„ค ๊ฐ€์ง€ ํ•ต์‹ฌ ์˜์—ญ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ์งธ, ์ง€๋Šฅํ˜• ์—์ด์ „ํŠธ์˜ ๋ชจ๋“ˆ์‹ ๊ธฐ๋ฐ˜์„ ์ธ๊ฐ„ ๋‡Œ ๊ธฐ๋Šฅ์— ๋งตํ•‘ํ•˜์—ฌ ๊ธฐ์–ต, ์„ธ๊ณ„ ๋ชจ๋ธ๋ง, ๋ณด์ƒ ์ฒ˜๋ฆฌ ๋ฐ ๊ฐ์ • ์‹œ์Šคํ…œ์„ ํฌํ•จํ•œ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๋‘˜์งธ, ์ž๊ฐ€ ๊ฐœ์„  ๋ฐ ์ ์‘ํ˜• ์ง„ํ™” ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ์—์ด์ „ํŠธ๊ฐ€ ์ž์œจ์ ์œผ๋กœ ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์…‹์งธ, ํ˜‘๋ ฅ์  ๋ฐ ์ง„ํ™”์  ๋‹ค์ค‘ ์—์ด์ „ํŠธ ์‹œ์Šคํ…œ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ์ง‘๋‹จ ์ง€๋Šฅ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์•ˆ์ „ํ•˜๊ณ  ์œ ์ตํ•œ AI ์‹œ์Šคํ…œ ๊ตฌ์ถ•์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•˜์—ฌ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฐํฌ๋ฅผ ์œ„ํ•œ ๋ณด์•ˆ ๋ฐ ์œค๋ฆฌ์  ์ •๋ ฌ ์ „๋žต์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

Paper, Project

AnimeGamer๋Š” ๋‹ค์Œ ๊ฒŒ์ž„ ์ƒํƒœ ์˜ˆ์ธก์„ ํ†ตํ•œ ๋ฌดํ•œ ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋ผ์ดํ”„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ตœ๊ทผ ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค ํ•ฉ์„ฑ์˜ ๋ฐœ์ „์€ ์ƒ์„ฑํ˜• ๊ฒŒ์ž„์— ์ƒˆ๋กœ์šด ๊ฐ€๋Šฅ์„ฑ์„ ์—ด์—ˆ์œผ๋ฉฐ, ํŠนํžˆ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์˜ํ™”์˜ ์บ๋ฆญํ„ฐ๋ฅผ ์ƒํ˜ธ์ž‘์šฉ ๊ฐ€๋Šฅํ•œ ๊ฒŒ์ž„ ์บ๋ฆญํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์‘์šฉ์ด ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํ”Œ๋ ˆ์ด์–ด๋Š” ์–ธ์–ด ์ง€์‹œ๋ฅผ ํ†ตํ•ด ์ข‹์•„ํ•˜๋Š” ์บ๋ฆญํ„ฐ๋กœ์„œ ์—ญ๋™์ ์ธ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์„ธ๊ณ„์— ๋ชฐ์ž…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(MLLM)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ ๊ฒŒ์ž„ ์ƒํƒœ๋ฅผ ์ƒ์„ฑํ•˜๋Š” AnimeGamer๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์บ๋ฆญํ„ฐ ์›€์ง์ž„๊ณผ ์ƒํƒœ ์—…๋ฐ์ดํŠธ๋ฅผ ๋ฌ˜์‚ฌํ•˜๋Š” ๋™์  ์• ๋‹ˆ๋ฉ”์ด์…˜ ์žฅ๋ฉด์„ ํฌํ•จํ•œ ๊ฒŒ์ž„ ์ƒํƒœ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋น„๋””์˜ค ํ™•์‚ฐ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ํ’ˆ์งˆ ๋น„๋””์˜ค ํด๋ฆฝ์œผ๋กœ ๋””์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ๋Š” ์•ก์…˜ ์ธ์‹ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ํ‘œํ˜„์„ ๋„์ž…ํ–ˆ์œผ๋ฉฐ, ์—ญ์‚ฌ์  ์• ๋‹ˆ๋ฉ”์ด์…˜ ์žฅ๋ฉด ํ‘œํ˜„์„ ์ปจํ…์ŠคํŠธ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ํ›„์† ํ‘œํ˜„์„ ์˜ˆ์ธกํ•จ์œผ๋กœ์จ ๋งฅ๋ฝ ์ผ๊ด€์„ฑ๊ณผ ๋งŒ์กฑ์Šค๋Ÿฌ์šด ์—ญ๋™์„ฑ์„ ๊ฐ€์ง„ ๊ฒŒ์ž„์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ถ„์•ผ

AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

Paper, Project

์ด ์—ฐ๊ตฌ๋Š” ํŠน์ • ๋„๋ฉ”์ธ์— ์ดˆ์ ์„ ๋งž์ถ˜ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ํšจ์œจ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•œ AdaptiVocab์ด๋ผ๋Š” ์ ‘๊ทผ๋ฒ•์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. LLM์€ ๋ฒ”์šฉ ๋ชจ๋ธ๋กœ์„œ ์ธ์ƒ์ ์ธ ๋‹ค์žฌ๋‹ค๋Šฅํ•จ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ํŠนํžˆ ์ž๋™ ํšŒ๊ท€ ๋””์ฝ”๋”ฉ์—์„œ ๊ฐ ๋‹จ๊ณ„๋งˆ๋‹ค ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค๊ฐ€ ํ•„์š”ํ•œ ๋†’์€ ๊ณ„์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. AdaptiVocab์€ ํ† ํฌ๋‚˜์ด์ €์™€ ์•„ํ‚คํ…์ฒ˜์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์–ดํœ˜ ์ ์‘์„ ์œ„ํ•œ ์—”๋“œํˆฌ์—”๋“œ ๋ฐฉ์‹์œผ๋กœ, ๊ธฐ์กด ํ† ํฐ์„ ๋„๋ฉ”์ธ ํŠนํ™”๋œ n-๊ทธ๋žจ ๊ธฐ๋ฐ˜ ํ† ํฐ์œผ๋กœ ๋Œ€์ฒดํ•˜์—ฌ ์ž…๋ ฅ ์ฒ˜๋ฆฌ์™€ ์ถœ๋ ฅ ์ƒ์„ฑ์— ํ•„์š”ํ•œ ํ† ํฐ ์ˆ˜๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์€ ๊ธฐ์กด ์ž„๋ฒ ๋”ฉ์˜ ์ง€์ˆ˜ ๊ฐ€์ค‘ ์กฐํ•ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด n-ํ† ํฐ ์ž„๋ฒ ๋”ฉ์„ ์ดˆ๊ธฐํ™”ํ•˜๊ณ , ๋‹จ์ผ GPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ๋Ÿ‰ ๋ฏธ์„ธ ์กฐ์ • ๋‹จ๊ณ„๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์„ธ ๊ฐ€์ง€ ํ‹ˆ์ƒˆ ๋„๋ฉ”์ธ์—์„œ ๋‘ ๊ฐœ์˜ 7B LLM์„ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, AdaptiVocab์€ ์„ฑ๋Šฅ์„ ์†์ƒ์‹œํ‚ค์ง€ ์•Š์œผ๋ฉด์„œ๋„ ํ† ํฐ ์‚ฌ์šฉ๋Ÿ‰์„ 25% ์ด์ƒ ๊ฐ์†Œ์‹œํ‚ค๋Š” ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Paper, Project

Open-Reasoner-Zero๋Š” ๊ธฐ๋ณธ ๋ชจ๋ธ์—์„œ ๊ฐ•ํ™” ํ•™์Šต์„ ํ™•์žฅํ•˜๋Š” ์ฒซ ๋ฒˆ์งธ ์˜คํ”ˆ ์†Œ์Šค ๊ตฌํ˜„์œผ๋กœ, ํ™•์žฅ์„ฑ, ๋‹จ์ˆœ์„ฑ, ์ ‘๊ทผ์„ฑ์— ์ค‘์ ์„ ๋‘” ๋Œ€๊ทœ๋ชจ ์ถ”๋ก  ์ง€ํ–ฅ RL ํ›ˆ๋ จ์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด, ๋ฐ”๋‹๋ผ PPO์™€ GAE(lambda=1, gamma=1) ๋ฐ ๊ฐ„๋‹จํ•œ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ๋ณด์ƒ์„ ์‚ฌ์šฉํ•˜๋Š” ์ตœ์†Œ์ฃผ์˜์  ์ ‘๊ทผ๋ฒ•์ด KL ์ •๊ทœํ™” ์—†์ด๋„ ์‘๋‹ต ๊ธธ์ด์™€ ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ์ถฉ๋ถ„ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. DeepSeek-R1-Zero-Qwen-32B์™€ ๋™์ผํ•œ ๊ธฐ๋ณธ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ, ์ด ๊ตฌํ˜„์€ DeepSeek-R1-Zero ํŒŒ์ดํ”„๋ผ์ธ์— ๋น„ํ•ด ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ 10๋ถ„์˜ 1๋งŒ ํ•„์š”๋กœ ํ•˜๋ฉด์„œ๋„ AIME2024, MATH500 ๋ฐ GPQA Diamond ๋ฒค์น˜๋งˆํฌ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜คํ”ˆ ์†Œ์Šค ์ •์‹ ์— ๋งž๊ฒŒ ์†Œ์Šค ์ฝ”๋“œ, ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ๋ฐ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณต๊ฐœํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

Paper, Project

MergeVQ๋Š” ๋น„์ „ ์ƒ์„ฑ๊ณผ ํ‘œํ˜„์„ ์œ„ํ•œ ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ์„œ, ํ† ํฐ ๋ณ‘ํ•ฉ๊ณผ ์–‘์žํ™”๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๋ฒกํ„ฐ ์–‘์žํ™”(VQ)๋ฅผ ์‚ฌ์šฉํ•œ ๋งˆ์Šคํฌ ์ด๋ฏธ์ง€ ๋ชจ๋ธ๋ง(MIM)์€ ์ž๊ธฐ์ง€๋„ ์‚ฌ์ „ ํ›ˆ๋ จ๊ณผ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋‘์—์„œ ํฐ ์„ฑ๊ณต์„ ๊ฑฐ๋‘์—ˆ์ง€๋งŒ, ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์€ ์ƒ์„ฑ ํ’ˆ์งˆ๊ณผ ํ‘œํ˜„ ํ•™์Šต, ํšจ์œจ์„ฑ ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ๋งž์ถ”๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช์—ˆ์Šต๋‹ˆ๋‹ค. MergeVQ๋Š” ํ† ํฐ ๋ณ‘ํ•ฉ ๊ธฐ์ˆ ์„ VQ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๋ชจ๋ธ์— ํ†ตํ•ฉํ•˜์—ฌ ์ด ๊ฒฉ์ฐจ๋ฅผ ํ•ด์†Œํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์ „ ํ›ˆ๋ จ ์ค‘์— MergeVQ๋Š” ์ธ์ฝ”๋”์˜ ์ž๊ธฐ ์ฃผ์˜ ๋ธ”๋ก ์ดํ›„ ํ† ํฐ ๋ณ‘ํ•ฉ ๋ชจ๋“ˆ์„ ํ†ตํ•ด Look-up Free Quantization(LFQ)๊ณผ ๊ธ€๋กœ๋ฒŒ ์ •๋ ฌ์„ ์œ„ํ•œ ์ž ์žฌ ๊ณต๊ฐ„์—์„œ ์ƒ์œ„ k ์˜๋ฏธ๋ก ์„ ๋ถ„๋ฆฌํ•˜๊ณ , ๋””์ฝ”๋”์˜ ํฌ๋กœ์Šค ์–ดํ…์…˜์„ ํ†ตํ•ด ๋ณต์›์„ ์œ„ํ•œ ์„ธ๋ถ€ ์ •๋ณด๋ฅผ ๋ณต๊ตฌํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์ œ2๋‹จ๊ณ„ ์ƒ์„ฑ์„ ์œ„ํ•ด ํšจ์œจ์ ์ธ ๋ž˜์Šคํ„ฐ ์ˆœ์„œ ์˜ˆ์ธก์„ ์œ„ํ•œ KV Cache ์••์ถ•์„ ์ˆ˜ํ–‰ํ•˜๋Š” MergeAR์„ ๋„์ž…ํ•˜์—ฌ ํ† ํฐ ํšจ์œจ์„ฑ๊ณผ ์ถ”๋ก  ์†๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์‹œ๊ฐ์  ํ‘œํ˜„ ํ•™์Šต๊ณผ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ์ž‘์—… ๋ชจ๋‘์—์„œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

Paper, Project

์ด ๋…ผ๋ฌธ์€ ๋ณต์žกํ•œ ์‹œ๊ฐ์  ํ…์ŠคํŠธ ์ƒ์„ฑ(CVTG) ์ž‘์—…์˜ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ TextCrafter๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ๋“ค์€ ์™œ๊ณก๋˜๊ณ  ํ๋ฆฟํ•œ ์‹œ๊ฐ์  ํ…์ŠคํŠธ๋ฅผ ๋ Œ๋”๋งํ•˜๊ฑฐ๋‚˜ ์ผ๋ถ€ ํ…์ŠคํŠธ๋ฅผ ๋ˆ„๋ฝํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. TextCrafter๋Š” ๋ณต์žกํ•œ ์‹œ๊ฐ์  ํ…์ŠคํŠธ๋ฅผ ๊ฐœ๋ณ„ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ ์ง„์ ์œผ๋กœ ๋ถ„ํ•ดํ•˜๋ฉด์„œ ํ…์ŠคํŠธ ๋‚ด์šฉ๊ณผ ์‹œ๊ฐ์  ๋งค์ฒด ๊ฐ„์˜ ๊ฐ•๋ ฅํ•œ ์ •๋ ฌ์„ ๋ณด์žฅํ•˜๋Š” ์ „๋žต์„ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์ƒ์„ฑ ๊ณผ์ •์—์„œ ์‹œ๊ฐ์  ํ…์ŠคํŠธ์˜ ์ค‘์š”์„ฑ์„ ์ฆํญ์‹œํ‚ค๋Š” ํ† ํฐ ํฌ์ปค์Šค ๊ฐ•ํ™” ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๋„์ž…ํ•˜์—ฌ ํ…์ŠคํŠธ ํ˜ผ๋ž€, ๋ˆ„๋ฝ, ํ๋ฆฟํ•จ๊ณผ ๊ฐ™์€ CVTG ์ž‘์—…์˜ ์ฃผ์š” ๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ตฌํŒ€์€ CVTG-2K๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹์„ ์ œ์‹œํ•˜์—ฌ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์—„๊ฒฉํ•˜๊ฒŒ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€์œผ๋ฉฐ, ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์ด ๋ฐฉ๋ฒ•์ด ์ตœ์‹  ์ ‘๊ทผ๋ฒ•์„ ๋Šฅ๊ฐ€ํ•จ์„ ์ž…์ฆํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ถ„์•ผ

MoCha: Towards Movie-Grade Talking Character Synthesis

Paper, Project

MoCha๋Š” ์˜ํ™”๊ธ‰ ๋Œ€ํ™”ํ˜• ์บ๋ฆญํ„ฐ ํ•ฉ์„ฑ์„ ์œ„ํ•œ ํ˜์‹ ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ตœ๊ทผ ๋น„๋””์˜ค ์ƒ์„ฑ์˜ ๋ฐœ์ „์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์บ๋ฆญํ„ฐ ์ค‘์‹ฌ ์Šคํ† ๋ฆฌํ…”๋ง์€ ๊ฐ„๊ณผ๋˜์–ด ์™”์œผ๋‚˜, ์ด ์—ฐ๊ตฌ๋Š” ์Œ์„ฑ๊ณผ ํ…์ŠคํŠธ์—์„œ ์ง์ ‘ ๋Œ€ํ™”ํ˜• ์บ๋ฆญํ„ฐ ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ์ƒ์„ฑํ•˜๋Š” "Talking Characters"๋ผ๋Š” ๋ณด๋‹ค ํ˜„์‹ค์ ์ธ ์ž‘์—…์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ talking head์™€ ๋‹ฌ๋ฆฌ, Talking Characters๋Š” ์–ผ๊ตด ์˜์—ญ์„ ๋„˜์–ด ํ•œ ๋ช… ์ด์ƒ ์บ๋ฆญํ„ฐ์˜ ์ „์‹  ์ดˆ์ƒํ™” ์ƒ์„ฑ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๋น„๋””์˜ค์™€ ์Œ์„ฑ์˜ ์ •ํ™•ํ•œ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด speech-video window attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ œ์•ˆํ•˜๊ณ , ๋Œ€๊ทœ๋ชจ ์Œ์„ฑ ๋ผ๋ฒจ๋ง ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ€์กฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๊ณต๋™ ํ›ˆ๋ จ ์ „๋žต์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์บ๋ฆญํ„ฐ ํƒœ๊ทธ๊ฐ€ ์žˆ๋Š” ๊ตฌ์กฐํ™”๋œ ํ”„๋กฌํ”„ํŠธ ํ…œํ”Œ๋ฆฟ์„ ์„ค๊ณ„ํ•˜์—ฌ ์—…๊ณ„ ์ตœ์ดˆ๋กœ ํ„ด ๊ธฐ๋ฐ˜ ๋Œ€ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•œ ๋‹ค์ค‘ ์บ๋ฆญํ„ฐ ๋Œ€ํ™”๋ฅผ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋กœ์จ AI ์ƒ์„ฑ ์บ๋ฆญํ„ฐ๊ฐ€ ์˜ํ™”์  ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋งฅ๋ฝ ์ธ์‹ ๋Œ€ํ™”์— ์ฐธ์—ฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Paper, Project

์ด ๋…ผ๋ฌธ์€ ์‚ฌ์šฉ์ž ์˜๋„ ํ•ด์„์˜ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ Any2Caption์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋‹ค์–‘ํ•œ ์กฐ๊ฑด ํ•ด์„ ๋‹จ๊ณ„๋ฅผ ๋น„๋””์˜ค ํ•ฉ์„ฑ ๋‹จ๊ณ„์™€ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ตœ์‹  ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(MLLM)์„ ํ™œ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€, ๋น„๋””์˜ค, ์ง€์—ญ, ๋ชจ์…˜, ์นด๋ฉ”๋ผ ํฌ์ฆˆ์™€ ๊ฐ™์€ ํŠน์ˆ˜ํ•œ ํ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์ž…๋ ฅ์„ ๋ฐ€๋„ ์žˆ๊ณ  ๊ตฌ์กฐํ™”๋œ ์บก์…˜์œผ๋กœ ํ•ด์„ํ•˜์—ฌ ๋น„๋””์˜ค ์ƒ์„ฑ๊ธฐ์— ๋” ๋‚˜์€ ์ง€์นจ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ 337K ์ธ์Šคํ„ด์Šค์™€ 407K ์กฐ๊ฑด์„ ํฌํ•จํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์ธ Any2CapIns๋ฅผ ๋„์ž…ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์กฐ๊ฑด์—์„œ ์บก์…˜ ์ง€์‹œ ํŠœ๋‹์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ํฌ๊ด„์ ์ธ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ์ด ์‹œ์Šคํ…œ์ด ๊ธฐ์กด ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ์˜ ๋‹ค์–‘ํ•œ ์ธก๋ฉด์—์„œ ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ๊ณผ ๋น„๋””์˜ค ํ’ˆ์งˆ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Paper, Project

์ด ๋…ผ๋ฌธ์€ ์ถ”๋ก  ๊ธฐ๋ฐ˜ ์‹œ๊ฐ์  ํŽธ์ง‘(RISE)์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์ดˆ์˜ ๋ฒค์น˜๋งˆํฌ์ธ RISEBench๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๋Œ€ํ˜• ๋‹ค์ค‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ชจ๋ธ(LMM)์€ ์‹œ๊ฐ์  ์ดํ•ด์™€ ์ƒ์„ฑ์— ์ƒ๋‹นํ•œ ์ง„์ „์„ ์ด๋ฃจ์—ˆ์ง€๋งŒ, ๋ณต์žกํ•œ ์ง€์‹œ๋ฅผ ๋”ฐ๋ฅด๊ณ , ์™ธ๊ด€ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋ฉฐ, ์œ ์—ฐํ•œ ์ž…๋ ฅ ํ˜•์‹์„ ์ง€์›ํ•˜๋Š” ์ผ๋ฐ˜ ์‹œ๊ฐ์  ํŽธ์ง‘์—์„œ๋Š” ์—ฌ์ „ํžˆ ์–ด๋ ค์›€์ด ์žˆ์Šต๋‹ˆ๋‹ค. RISEBench๋Š” ์‹œ๊ฐ„์ , ์ธ๊ณผ์ , ๊ณต๊ฐ„์ , ๋…ผ๋ฆฌ์  ์ถ”๋ก ์ด๋ผ๋Š” ๋„ค ๊ฐ€์ง€ ํ•ต์‹ฌ ์ถ”๋ก  ์œ ํ˜•์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ์— ๋Œ€ํ•œ ๊ณ ํ’ˆ์งˆ ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค๋ฅผ ์„ ๋ณ„ํ•˜๊ณ , ์ง€์‹œ ์ถ”๋ก , ์™ธ๊ด€ ์ผ๊ด€์„ฑ, ์‹œ๊ฐ์  ํƒ€๋‹น์„ฑ์„ ์ธ๊ฐ„ ์‹ฌ์‚ฌ์œ„์›๊ณผ LMM-as-a-judge ์ ‘๊ทผ๋ฒ•์œผ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, GPT-4o-Native๊ฐ€ ๋‹ค๋ฅธ ์˜คํ”ˆ ์†Œ์Šค ๋ฐ ๋…์  ๋ชจ๋ธ๋ณด๋‹ค ํฌ๊ฒŒ ์•ž์„œ์ง€๋งŒ, ์ด๋Ÿฌํ•œ ์ตœ์ฒจ๋‹จ ์‹œ์Šคํ…œ์กฐ์ฐจ๋„ ๋…ผ๋ฆฌ์  ์ถ”๋ก  ์ž‘์—…์—์„œ๋Š” ์–ด๋ ค์›€์„ ๊ฒช๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. RISEBench๋Š” ์ถ”๋ก  ์ธ์‹ ์‹œ๊ฐ์  ํŽธ์ง‘์— ๋Œ€ํ•œ ๊ธฐ์ดˆ์ ์ธ ํ†ต์ฐฐ์„ ์ œ๊ณตํ•˜๊ณ  ํ–ฅํ›„ ์—ฐ๊ตฌ๋ฅผ ์ด‰์ง„ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Paper, Project

์ด ์—ฐ๊ตฌ๋Š” R1-Zero์™€ ์œ ์‚ฌํ•œ ํ›ˆ๋ จ์„ ํ†ตํ•ด ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(MLLM)์˜ ์‹œ๊ฐ-๊ณต๊ฐ„ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ์‹ฌ์ธต์ ์œผ๋กœ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋ฌผ๋ฆฌ์  ์˜์—ญ์—์„œ ๊ธฐ๋Šฅํ•˜๋Š” AI ์—์ด์ „ํŠธ์˜ ์ดˆ์„์œผ๋กœ์„œ, ๋น„๋””์˜ค ๊ธฐ๋ฐ˜ ์‹œ๊ฐ-๊ณต๊ฐ„ ์ง€๋Šฅ(VSI)์ด MLLM์˜ ๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ์ถ”๋ก  ๋Šฅ๋ ฅ ์ค‘ ํ•˜๋‚˜๋กœ ๋ถ€์ƒํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์ˆ ์ ์œผ๋กœ๋Š” ๋จผ์ € ์†Œํ˜•-์ค‘ํ˜• Qwen2-VL ๋ชจ๋ธ์˜ ์‹œ๊ฐ-๊ณต๊ฐ„ ์ถ”๋ก  ๋Šฅ๋ ฅ์ด ์‚ฌ๊ณ  ์—ฐ์‡„(CoT) ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ†ตํ•ด ํ™œ์„ฑํ™”๋  ์ˆ˜ ์—†์Œ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ์ดํ›„ DeepSeek-R1-Zero๋ฅผ ๋”ฐ๋ผ ์กฐ์‹ฌ์Šค๋Ÿฝ๊ฒŒ ์„ ๋ณ„๋œ VSI-100k ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐœ์„ ๋œ ์‹œ๊ฐ-๊ณต๊ฐ„ ์ถ”๋ก ์„ ์œ„ํ•œ GRPO ํ›ˆ๋ จ์„ ํ†ตํ•ฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฐ๊ตฌ ๊ณผ์ •์—์„œ GRPO์—์„œ KL ํŽ˜๋„ํ‹ฐ๋ฅผ ์œ ์ง€ํ•ด์•ผ ํ•  ํ•„์š”์„ฑ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹จ 120 GPU ์‹œ๊ฐ„๋งŒ์œผ๋กœ Qwen2-VL-2B์—์„œ ๋ฏธ์„ธ ์กฐ์ •๋œ vsGRPO-2B ๋ชจ๋ธ์€ ๊ธฐ๋ณธ ๋ชจ๋ธ๋ณด๋‹ค 12.1% ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ GPT-4o๋ฅผ ๋Šฅ๊ฐ€ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์ปดํ“จํ„ฐ ๊ทธ๋ž˜ํ”ฝ์Šค ๋ฐ ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋ถ„์•ผ

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Paper, Project

DreamActor-M1์€ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฐ€์ด๋“œ๋ฅผ ํ†ตํ•œ ์ „์ฒด์ ์ด๊ณ  ํ‘œํ˜„๋ ฅ ์žˆ์œผ๋ฉฐ ๊ฒฌ๊ณ ํ•œ ์ธ๊ฐ„ ์ด๋ฏธ์ง€ ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ์œ„ํ•œ ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ตœ๊ทผ์˜ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ์ธ๊ฐ„ ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋ฐฉ๋ฒ•๋“ค์ด ์‚ฌ์‹ค์ ์ธ ์‹ ์ฒด์™€ ์–ผ๊ตด ๋ชจ์…˜ ํ•ฉ์„ฑ์„ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ, ๋ฏธ์„ธํ•œ ์ „์ฒด์  ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ, ๋‹ค์ค‘ ๊ทœ๋ชจ ์ ์‘์„ฑ, ์žฅ๊ธฐ์  ์‹œ๊ฐ„ ์ผ๊ด€์„ฑ์— ์žˆ์–ด ์ค‘์š”ํ•œ ๊ฒฉ์ฐจ๊ฐ€ ๋‚จ์•„์žˆ์–ด ํ‘œํ˜„๋ ฅ๊ณผ ๊ฒฌ๊ณ ์„ฑ์ด ๋‚ฎ์•„์ง‘๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ์•”์‹œ์  ์–ผ๊ตด ํ‘œํ˜„, 3D ํ—ค๋“œ ๊ตฌ์ฒด, 3D ์‹ ์ฒด ๊ณจ๊ฒฉ์„ ํ†ตํ•ฉํ•˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ œ์–ด ์‹ ํ˜ธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์–ผ๊ตด ํ‘œ์ •๊ณผ ์‹ ์ฒด ์›€์ง์ž„์„ ๊ฐ•๋ ฅํ•˜๊ฒŒ ์ œ์–ดํ•˜๋ฉด์„œ ํ‘œํ˜„๋ ฅ ์žˆ๊ณ  ์ •์ฒด์„ฑ์„ ๋ณด์กดํ•˜๋Š” ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ์‹ ์ฒด ํฌ์ฆˆ์™€ ์ดˆ์ƒํ™”๋ถ€ํ„ฐ ์ „์‹  ๋ทฐ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€ ๊ทœ๋ชจ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„์™€ ๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ์ ์ง„์  ํ›ˆ๋ จ ์ „๋žต์„ ์ฑ„ํƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ ์ด ๋ฐฉ๋ฒ•์ด ์ตœ์‹  ๊ธฐ์ˆ ์„ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, ์ดˆ์ƒํ™”, ์ƒ์ฒด, ์ „์‹  ์ƒ์„ฑ์—์„œ ํ‘œํ˜„๋ ฅ ์žˆ๋Š” ๊ฒฐ๊ณผ์™€ ๊ฐ•๋ ฅํ•œ ์žฅ๊ธฐ์  ์ผ๊ด€์„ฑ์„ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

Paper, Project

TokenHSI๋Š” ์ž‘์—… ํ† ํฐํ™”๋ฅผ ํ†ตํ•œ ๋ฌผ๋ฆฌ์  ์ธ๊ฐ„-์žฅ๋ฉด ์ƒํ˜ธ์ž‘์šฉ(HSI)์˜ ํ†ตํ•ฉ ํ•ฉ์„ฑ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•˜๊ณ  ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ HSI ํ•ฉ์„ฑ์€ ์ปดํ“จํ„ฐ ์• ๋‹ˆ๋ฉ”์ด์…˜๊ณผ ์ฒดํ™”๋œ AI ๋ชจ๋‘์—๊ฒŒ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ์˜ ๋ฐฉ๋ฒ•๋“ค์€ ์ฃผ๋กœ ํŠน์ • ์ƒํ˜ธ์ž‘์šฉ ์ž‘์—…์— ํŠนํ™”๋œ ๋ณ„๋„์˜ ์ปจํŠธ๋กค๋Ÿฌ ๊ฐœ๋ฐœ์— ์ค‘์ ์„ ๋‘๊ณ  ์žˆ์–ด, ๋ฌผ๊ฑด์„ ๋“ค๊ณ  ์•‰๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ด ์—ฌ๋Ÿฌ ๊ธฐ์ˆ ์˜ ํ†ตํ•ฉ์ด ํ•„์š”ํ•œ ๋‹ค์–‘ํ•œ ๋„์ „์  HSI ์ž‘์—…์„ ๋‹ค๋ฃจ๋Š” ๋Šฅ๋ ฅ์„ ํฌ๊ฒŒ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด TokenHSI๋Š” ๋‹ค์ค‘ ๊ธฐ์ˆ  ํ†ตํ•ฉ๊ณผ ์œ ์—ฐํ•œ ์ ์‘์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋‹จ์ผ, ํ†ตํ•ฉ ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ํœด๋จธ๋…ธ์ด๋“œ ๊ณ ์œ ์ˆ˜์šฉ์„ ๋ณ„๋„์˜ ๊ณต์œ  ํ† ํฐ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ณ  ๋งˆ์Šคํ‚น ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ๊ณ ์œ ํ•œ ์ž‘์—… ํ† ํฐ๊ณผ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ†ตํ•ฉ ์ •์ฑ…์€ ๊ธฐ์ˆ  ๊ฐ„์˜ ํšจ๊ณผ์ ์ธ ์ง€์‹ ๊ณต์œ ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ ๋‹ค์ค‘ ์ž‘์—… ํ›ˆ๋ จ์„ ์ด‰์ง„ํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ ์ด ์ ‘๊ทผ๋ฒ•์ด ๋‹ค์–‘ํ•œ HSI ์ž‘์—…์—์„œ ๋‹ค์–‘์„ฑ, ์ ์‘์„ฑ, ํ™•์žฅ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

profile
XR๊ณผ AI์— ๊ด€์‹ฌ์ด ๋งŽ์€ Sky ์ž…๋‹ˆ๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€