[2025/W23] ๐Ÿค— Weekly AI Research

Skyยท2025๋…„ 6์›” 6์ผ

Weekly AI Research Digest

๋ชฉ๋ก ๋ณด๊ธฐ
30/89

๊ฐ•ํ™”ํ•™์Šต(RL)์œผ๋กœ ์ถ”๋ก ์˜ ํ•œ๊ณ„๋ฅผ ๋ŒํŒŒํ•˜๊ณ , ์‹œ๊ฐ๊ณผ ์–ธ์–ด๋ฅผ ํ†ตํ•ฉํ•ด ํ˜„์‹ค๋กœ ๋‚˜์•„๊ฐ€๋Š” AI
๊ณ ์ •๋œ ๋ฐ์ดํ„ฐ์…‹์„ ๋„˜์–ด, ๋ชจ๋ธ์˜ ์„ฑ์žฅ์„ ๋•๋Š” ์—ญ๋™์ ์ธ ํ•™์Šต ํ™˜๊ฒฝ๊นŒ์ง€ ๊ตฌ์ถ•ํ•˜๋ฉฐ ๋ฐœ์ „ ๊ฐ€์†

TL;DR

LLM ์ถ”๋ก  ๋Šฅ๋ ฅ ๊ฐ•ํ™” ๋ฐ ์ตœ์ ํ™” ๋ถ„์•ผ์—์„œ๋Š” ์Šค์Šค๋กœ ์‹คํŒจ ์›์ธ์„ ๋ถ„์„ํ•˜๊ณ  ์žฌ๋„์ „ํ•˜์—ฌ ํ•™์Šตํ•˜๊ฑฐ๋‚˜(Reflect, Retry, Reward), ์žฅ์‹œ๊ฐ„์˜ ๊ฐ•ํ™” ํ•™์Šต์„ ํ†ตํ•ด ๊ธฐ์กด์— ์—†๋˜ ์ƒˆ๋กœ์šด ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๋งŒ๋“ค์–ด๋‚ด๊ณ (ProRL), ์ถ”๋ก  ๊ณผ์ •์— ๊ฒฐ์ •์ ์ธ ์†Œ์ˆ˜์˜ ํ† ํฐ๋งŒ ์ง‘์ค‘ ํ•™์Šตํ•˜์—ฌ ํšจ์œจ์„ ๊ทน๋Œ€ํ™”ํ•˜๋ฉฐ(Beyond the 80/20 Rule), ํ•„์š”ํ•  ๋•Œ๋งŒ ๊นŠ๊ฒŒ ์ƒ๊ฐํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ ๋ชจ๋‘ ์žก๋Š”(AlphaOne) ๋“ฑ ๊ฐ•ํ™” ํ•™์Šต์„ ํ™œ์šฉํ•ด LLM์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ•œ ๋‹จ๊ณ„ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ์—ฐ๊ตฌ๋“ค์ด ์†Œ๊ฐœ๋˜์—ˆ๋‹ค.

์‹œ๊ฐ๊ณผ ์–ธ์–ด๊ฐ€ ํ†ตํ•ฉ๋œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI ๋ถ„์•ผ์—์„œ๋Š” ์ตœ์‹  ๋น„๋””์˜ค ๋ชจ๋ธ์ด ์‹œ๊ฐ„์˜ ํ๋ฆ„์— ๋”ฐ๋ฅธ ๋ณ€ํ™”๋Š” ์ธ์ง€ํ•˜์ง€ ๋ชปํ•˜๋Š” '์‹œ๊ฐ„๋งน' ์ƒํƒœ์ž„์„ ์ง€์ ํ•˜๊ฑฐ๋‚˜(Time Blindness), ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ์ด๋ฏธ์ง€ ์ดํ•ด, ์ƒ์„ฑ, ํŽธ์ง‘๊นŒ์ง€ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜๊ณ (UniWorld), ์ž‘๊ณ  ํšจ์œจ์ ์ธ ๋ชจ๋ธ๋กœ ๋ˆ„๊ตฌ๋‚˜ ๊ณ ์„ฑ๋Šฅ ๋กœ๋ด‡ ์ œ์–ด๋ฅผ ์—ฐ๊ตฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ฌธํ„ฑ์„ ๋‚ฎ์ถ”๋ฉฐ(SmolVLA), ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์™€ ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์„ ์••๋„ํ•˜๋Š” ์ดˆ๊ณ ์„ฑ๋Šฅ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•˜๋Š”(MiMo-VL) ๋“ฑ ์‹œ๊ฐ๊ณผ ์–ธ์–ด๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ถ„์•ผ์˜ ๋ฐœ์ „์ด ๋‹ค๋ค„์กŒ๋‹ค.

๋ฐ์ดํ„ฐ์…‹ ์ƒํƒœ๊ณ„ ๊ตฌ์ถ• ๋ถ„์•ผ์—์„œ๋Š” ๊ธฐ์กด์˜ ๊ณ ์ •๋œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ฒ—์–ด๋‚˜, ๋ชจ๋ธ์˜ ์ˆ˜์ค€์— ๋งž์ถฐ ๋ฌดํ•œ์— ๊ฐ€๊นŒ์šด ์ถ”๋ก  ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์ฒด๊ณ„์ ์ธ ํ•™์Šต๊ณผ ํ‰๊ฐ€๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์ƒˆ๋กœ์šด ํ•™์Šต ํ™˜๊ฒฝ(REASONING GYM)์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค.

LLM ์ถ”๋ก  ๋Šฅ๋ ฅ ๊ฐ•ํ™” ๋ฐ ์ตœ์ ํ™”

Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

Paper

์ด ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ๊ณผ์ œ์— ์‹คํŒจํ–ˆ์„ ๋•Œ, ์Šค์Šค๋กœ ์‹คํŒจ ์›์ธ์„ ๋ถ„์„ํ•˜๋Š” '์„ฑ์ฐฐ' ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์žฌ๋„์ „ํ•˜์—ฌ ์„ฑ๊ณตํ•˜๋ฉด ๊ฐ•ํ™” ํ•™์Šต์œผ๋กœ ๋ณด์ƒํ•˜๋Š” 2๋‹จ๊ณ„ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด '์„ฑ์ฐฐ, ์žฌ์‹œ๋„, ๋ณด์ƒ'์˜ ์ˆœํ™˜ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์€ ์™ธ๋ถ€์˜ ๋ณต์žกํ•œ ํ”ผ๋“œ๋ฐฑ ์—†์ด ์˜ค์ง ์„ฑ๊ณต/์‹คํŒจ๋ผ๋Š” ์ด์ง„ ์ •๋ณด๋งŒ์œผ๋กœ๋„ ์Šค์Šค๋กœ ํ•™์Šตํ•˜๋ฉฐ, ์ˆ˜ํ•™ ๋ฌธ์ œ ํ’€์ด๋‚˜ ํ•จ์ˆ˜ ํ˜ธ์ถœ๊ณผ ๊ฐ™์€ ๋ณต์žกํ•œ ์ž‘์—…์—์„œ 10๋ฐฐ ์ด์ƒ ํฐ ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ๋“ฑ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋ฃฐ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Paper, Project

์ด ๋…ผ๋ฌธ์€ ๊ฐ•ํ™” ํ•™์Šต์ด ๋‹จ์ˆœํžˆ ๋ชจ๋ธ์— ๋‚ด์žฌ๋œ ๋Šฅ๋ ฅ์„ ์ฆํญ์‹œํ‚ค๋Š” ๊ฒƒ์„ ๋„˜์–ด ์‹ค์ œ๋กœ ์ƒˆ๋กœ์šด ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ๋Š”์ง€ ํƒ๊ตฌํ•˜๋ฉฐ, 'ProRL(์žฅ๊ธฐ ๊ฐ•ํ™” ํ•™์Šต)'์ด๋ผ๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•œ๋‹ค. ๋‹ค์–‘ํ•œ ๊ณผ์ œ์™€ ์ •๊ตํ•œ ์ œ์–ด ๊ธฐ๋ฒ•์„ ๋™์›ํ•ด ๋ชจ๋ธ์„ ์˜ค๋žœ ์‹œ๊ฐ„ ํ•™์Šต์‹œํ‚จ ๊ฒฐ๊ณผ, ๊ธฐ์กด ๋ชจ๋ธ์ด ์ˆ˜๋งŽ์€ ์‹œ๋„์—๋„ ํ’€์ง€ ๋ชปํ–ˆ๋˜ ๋ฌธ์ œ์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ํ•ด๊ฒฐ์ฑ…์„ ๋ฐœ๊ฒฌํ•˜๋Š” ๋“ฑ ์ถ”๋ก ์˜ ๊ฒฝ๊ณ„ ์ž์ฒด๊ฐ€ ํ™•์žฅ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์œผ๋ฉฐ, ์ด๋Š” ์ถฉ๋ถ„ํ•œ ์‹œ๊ฐ„๊ณผ ์—ฐ์‚ฐ์„ ํˆฌ์ž…ํ•œ ๊ฐ•ํ™” ํ•™์Šต์ด ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ทผ๋ณธ์ ์ธ ์ถ”๋ก  ํ•œ๊ณ„๋ฅผ ๋„˜์–ด์„ค ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€๋Šฅ์„ฑ์„ ์‹œ์‚ฌํ•œ๋‹ค.

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Paper, Project

์ด ์—ฐ๊ตฌ๋Š” ๊ฐ•ํ™” ํ•™์Šต์ด ์–ธ์–ด ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์›๋ฆฌ๋ฅผ 'ํ† ํฐ ์—”ํŠธ๋กœํ”ผ' ๊ด€์ ์—์„œ ๋ถ„์„ํ•˜์—ฌ, ์ถ”๋ก  ๊ณผ์ •์˜ ๋ฐฉํ–ฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ์†Œ์ˆ˜์˜ ๊ฒฐ์ •์ ์ธ '๊ณ ์—”ํŠธ๋กœํ”ผ ํ† ํฐ'์ด ์กด์žฌํ•จ์„ ๋ฐํ˜€๋ƒˆ๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด ์ „์ฒด ํ† ํฐ์ด ์•„๋‹Œ, ์ด์ฒ˜๋Ÿผ ์ค‘์š”ํ•œ ๋ถ„๊ธฐ์  ์—ญํ• ์„ ํ•˜๋Š” 20%์˜ ์†Œ์ˆ˜ ํ† ํฐ์—๋งŒ ํ•™์Šต์„ ์ง‘์ค‘์‹œ์ผฐ์„ ๋•Œ ์ „์ฒด๋ฅผ ํ•™์Šต์‹œํ‚จ ๊ฒƒ๋ณด๋‹ค ์˜คํžˆ๋ ค ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ–ˆ์œผ๋ฉฐ, ์ด๋Š” ๊ฐ•ํ™” ํ•™์Šต์˜ ํšจ์œจ์„ฑ์ด ํ•ต์‹ฌ์ ์ธ ์†Œ์ˆ˜ ํ† ํฐ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐ์„œ ๋น„๋กฏ๋œ๋‹ค๋Š” ์ƒˆ๋กœ์šด ์‚ฌ์‹ค์„ ์ž…์ฆํ•œ๋‹ค.

AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

Paper, Project

์ด ์—ฐ๊ตฌ๋Š” ๋ชจ๋ธ์ด ์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ…Œ์ŠคํŠธ ์‹œ์ ์—์„œ '๋А๋ฆฐ ์‚ฌ๊ณ '์™€ '๋น ๋ฅธ ์‚ฌ๊ณ '๋ฅผ ์œ ์—ฐํ•˜๊ฒŒ ์กฐ์ ˆํ•˜๋Š” ๋ฒ”์šฉ ํ”„๋ ˆ์ž„์›Œํฌ 'AlphaOne'์„ ์ œ์•ˆํ•œ๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” '์•ŒํŒŒ ๋ชจ๋ฉ˜ํŠธ'๋ผ๋Š” ๊ฐœ๋…์„ ํ†ตํ•ด ๊นŠ์ด ์žˆ๋Š” ์‚ฌ๊ณ ๊ฐ€ ํ•„์š”ํ•œ ๊ตฌ๊ฐ„์—์„œ๋Š” ์ถ”๋ก  ํ† ํฐ์„ ๋™์ ์œผ๋กœ ์‚ฝ์ž…ํ•ด ์‹ ์ค‘ํ•˜๊ฒŒ ์ƒ๊ฐํ•˜๊ฒŒ ํ•˜๊ณ , ์ดํ›„ ๋‹จ๊ณ„์—์„œ๋Š” ์‹ ์†ํ•˜๊ฒŒ ์ •๋‹ต์„ ์ƒ์„ฑํ•˜๋„๋ก ์ „ํ™˜ํ•จ์œผ๋กœ์จ, ๋‹ค์–‘ํ•œ ์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด ๋ฐฉ์‹๋“ค๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•˜๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ๋ฒ•์„ ๋ณด์—ฌ์ค€๋‹ค.

์‹œ๊ฐ๊ณผ ์–ธ์–ด๊ฐ€ ํ†ตํ•ฉ๋œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI ๋ถ„์•ผ

Time Blindness: Why Video-Language Models Can't See What Humans Can?

Paper, Project

์ด ์—ฐ๊ตฌ๋Š” ์ตœ์‹  ๋น„๋””์˜ค-์–ธ์–ด ๋ชจ๋ธ๋“ค์ด ํ”„๋ ˆ์ž„ ์† ๊ณต๊ฐ„ ์ •๋ณด์—๋Š” ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ์˜ค์ง ์‹œ๊ฐ„์˜ ํ๋ฆ„ ์†์—๋งŒ ์ธ์ฝ”๋”ฉ๋œ ์ •๋ณด๋Š” ์ „ํ˜€ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๋Š” '์‹œ๊ฐ„๋งน(Time Blindness)'์ด๋ผ๋Š” ๊ทผ๋ณธ์ ์ธ ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•œ๋‹ค. ์—ฐ๊ตฌ์ง„์ด ์ง์ ‘ ๊ฐœ๋ฐœํ•œ 'SpookyBench' ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ†ตํ•ด ์ธ๊ฐ„์€ 98% ์ด์ƒ ์ธ์‹ํ•˜๋Š” ์‹œ๊ฐ„์  ํŒจํ„ด์„ ์ตœ์ฒจ๋‹จ ๋ชจ๋ธ๋“ค์€ 0%์˜ ์ •ํ™•๋„๋กœ ์ „ํ˜€ ๊ฐ์ง€ํ•˜์ง€ ๋ชปํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ์œผ๋ฉฐ, ์ด๋Š” ํ˜„์žฌ ๋ชจ๋ธ๋“ค์ด ๊ณต๊ฐ„์  ํŠน์ง•๊ณผ ์‹œ๊ฐ„์  ์ฒ˜๋ฆฌ๋ฅผ ๋ถ„๋ฆฌํ•˜์ง€ ๋ชปํ•˜๋Š” ์‹ฌ๊ฐํ•œ ์•ฝ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ์„ ๋ช…ํ™•ํžˆ ๋“œ๋Ÿฌ๋‚ธ๋‹ค.

UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Paper, Project

์ด ์—ฐ๊ตฌ๋Š” ์ด๋ฏธ์ง€ ์ดํ•ด, ์ƒ์„ฑ, ํŽธ์ง‘์„ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด, ๊ธฐ์กด์˜ VAE ๋ฐฉ์‹ ๋Œ€์‹  ๊ฐ•๋ ฅํ•œ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ์˜ '์‹œ๋งจํ‹ฑ ์ธ์ฝ”๋”' ํŠน์ง•์„ ํ™œ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ 'UniWorld'๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์„ ํ†ตํ•ด ๊ฒฝ์Ÿ ๋ชจ๋ธ๋ณด๋‹ค 100๋ฐฐ๋‚˜ ์ ์€ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ๋„ ์ด๋ฏธ์ง€ ํŽธ์ง‘ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ–ˆ์œผ๋ฉฐ, ๋™์‹œ์— ๋›ฐ์–ด๋‚œ ์ด๋ฏธ์ง€ ์ดํ•ด ๋ฐ ์ƒ์„ฑ ๋Šฅ๋ ฅ์„ ์œ ์ง€ํ•จ์œผ๋กœ์จ, ์‹œ๋งจํ‹ฑ ํŠน์ง•์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋‹ค์–‘ํ•œ ์‹œ๊ฐ์  ๊ณผ์ œ๋ฅผ ์•„์šฐ๋ฅด๋Š” ๊ฐ•๋ ฅํ•œ ํ†ตํ•ฉ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ๋งค์šฐ ํšจ๊ณผ์ ์ธ ์ „๋žต์ž„์„ ์ž…์ฆํ•œ๋‹ค.

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper, Project

์ด ๋…ผ๋ฌธ์€ ๊ธฐ์กด ๋กœ๋ณดํ‹ฑ์Šค ๋ชจ๋ธ๋“ค์ด ๋„ˆ๋ฌด ํฌ๊ณ  ๋น„์‹ธ ์‹ค์ œ ํ™˜๊ฒฝ์— ์ ์šฉํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ž‘๊ณ  ํšจ์œจ์ ์ธ ๋น„์ „-์–ธ์–ด-ํ–‰๋™ ๋ชจ๋ธ์ธ 'SmolVLA'๋ฅผ ์ œ์‹œํ•œ๋‹ค. ๋‹จ์ผ GPU์—์„œ๋„ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•˜๊ณ  ์ผ๋ฐ˜ PC์—์„œ๋„ ๊ตฌ๋™๋  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋œ ์ด ๋ชจ๋ธ์€, ํฌ๊ธฐ๊ฐ€ 10๋ฐฐ ์ด์ƒ ํฐ ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•ด๋„ ๋’ค์ง€์ง€ ์•Š๋Š” ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๋” ์ ์€ ๋น„์šฉ๊ณผ ์ž์›์œผ๋กœ๋„ ๊ณ ์„ฑ๋Šฅ ๋กœ๋ด‡ ์ œ์–ด ๊ธฐ์ˆ ์„ ์—ฐ๊ตฌํ•˜๊ณ  ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ธธ์„ ์—ด์–ด์ค€๋‹ค.

MiMo-VL Technical Report

Paper, Project

์ด ๋ฌธ์„œ๋Š” ์ƒค์˜ค๋ฏธ๊ฐ€ ๊ฐœ๋ฐœํ•œ ์ตœ์ฒจ๋‹จ ์˜คํ”ˆ์†Œ์Šค ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ 'MiMo-VL'์˜ ๊ธฐ์ˆ ์  ์„ฑ๊ณผ๋ฅผ ์ƒ์„ธํžˆ ๋ณด๊ณ ํ•œ๋‹ค. 2.4์กฐ ๊ฐœ์˜ ๋ฐฉ๋Œ€ํ•œ ํ† ํฐ์œผ๋กœ 4๋‹จ๊ณ„์— ๊ฑธ์ณ ์‚ฌ์ „ ํ›ˆ๋ จํ•˜๊ณ , ๋‹ค์–‘ํ•œ ๋ณด์ƒ ์‹ ํ˜ธ๋ฅผ ๊ฒฐํ•ฉํ•œ ํ˜ผํ•ฉ ์ •์ฑ… ๊ฐ•ํ™” ํ•™์Šต(MORL)์„ ์ ์šฉํ•˜์—ฌ, ๊ฒฝ์Ÿ ๋ชจ๋ธ๋“ค์„ ์••๋„ํ•˜๊ณ  ์ˆ˜์‹ญ์–ต ๊ฐœ ๋” ๋งŽ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ๋ณด๋‹ค๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, GUI ์ œ์–ด์™€ ๊ฐ™์€ ์ „๋ฌธ ๋ถ„์•ผ์—์„œ๋„ ์ƒˆ๋กœ์šด ์ตœ๊ณ  ๊ธฐ๋ก์„ ์„ธ์šฐ๋Š” ๋“ฑ ๊ฐ•๋ ฅํ•œ ๋ฒ”์šฉ ์‹œ๊ฐ ๋ฐ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”์—ˆ์Œ์„ ์•Œ๋ฆฐ๋‹ค.

๋ฐ์ดํ„ฐ์…‹ ์ƒํƒœ๊ณ„ ๊ตฌ์ถ•

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Paper, Project

์ด ๋…ผ๋ฌธ์€ ๊ธฐ์กด์˜ ๊ณ ์ •๋œ ๋ฐ์ดํ„ฐ์…‹์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, 'Reasoning Gym(RG)'์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ์ถ”๋ก  ํ•™์Šต ํ™˜๊ฒฝ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค. RG๋Š” ์ ˆ์ฐจ์  ์ƒ์„ฑ ๋ฐฉ์‹์„ ํ†ตํ•ด ๋Œ€์ˆ˜ํ•™, ๋…ผ๋ฆฌํ•™, ๊ฒŒ์ž„ ๋“ฑ ๋‹ค์–‘ํ•œ ์˜์—ญ์—์„œ ๋‚œ์ด๋„ ์กฐ์ ˆ์ด ๊ฐ€๋Šฅํ•œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฑฐ์˜ ๋ฌดํ•œํ•˜๊ฒŒ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ์–ธ์–ด ๋ชจ๋ธ์˜ ์ˆ˜์ค€์— ๋งž์ถฐ ์ง€์†์ ์œผ๋กœ ๊ณผ์ œ๋ฅผ ์ œ๊ณตํ•จ์œผ๋กœ์จ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๋”์šฑ ์ฒด๊ณ„์ ์ด๊ณ  ํšจ๊ณผ์ ์œผ๋กœ ํ›ˆ๋ จํ•˜๊ณ  ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•œ๋‹ค.

profile
XR๊ณผ AI์— ๊ด€์‹ฌ์ด ๋งŽ์€ Sky ์ž…๋‹ˆ๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€