Mastering diverse control tasks through world models(NeurIPS 2025)

๋А๋ฆฌยท2025๋…„ 5์›” 30์ผ
0

paper-review

๋ชฉ๋ก ๋ณด๊ธฐ
14/15

๐Ÿฆ์™œ ์ด๊ฑธ ํƒํ–ˆ๋Š”๊ฐ€~
์ผ๋‹จ ์ „ํ†ต์ ์ธ RL์—์„œ๋Š” ๊ตฌ์กฐ๊ฐ€ ๋‹จ์ˆœํ–ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ Dreamer๋Š” ๋‹จ์ˆœํžˆ ์ •์ฑ… ๋„คํŠธ์›Œํฌ๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ "์„ธ์ƒ์„ ์ƒ์ƒํ•˜๋Š” ๋‡Œ"๋ฅผ ๋งŒ๋“ค๊ณ  ์ด๊ฑธ ๊ธฐ๋ฐ˜์œผ๋กœ ํ–‰๋™์„ ๊ฒฐ์ •ํ–ˆ๋‹ค.... ๋Š” ์ ์ด ์‹ ๊ธฐํ•ด์„œ!ใ…Žใ…Ž

  • ๊ฐ•ํ™”ํ•™์Šต๋„ RNN์ด๋‚˜ ์—ฌํƒ€ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ด์šฉํ•ด์„œ ์ข€ ๋” ์‹ค์šฉ์ ์ด๊ณ  ๋ฒ”์šฉ์ ์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๊ฒ ๋‹ค... ๊ทธ๋Ÿฐ ์ƒ๊ฐ๋„ ํ•จ.
  • ๋„“์€ ๋ฒ”์œ„์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํƒœ์Šคํฌ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฒ”์šฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์€ AI์—์„œ ๋งค์šฐ ์ค‘์š”ํ•˜๋ฉด์„œ๋„ ๋„์ „์ ์ธ ๊ณผ์ œ๋‹ค.
  • ํ˜„์žฌ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋งŒ๋“ค์–ด์ง„ ๋ชฉ์ ์— ๋ถ€ํ•ฉํ•˜์ง€๋งŒ, ์ƒˆ๋กœ์šด ๋„๋ฉ”์ธ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ถ€๊ฐ€์ ์ธ ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€์™€ ๊ฒฝํ—˜์ด ํ•„์š”ํ•˜๋‹ค.
  • ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ Dreamer3์€ 150๊ฐ€์ง€์˜ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์—์„œ ํŠนํ™”๋œ ๋ฐฉ๋ฒ•๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด์ธ๋‹ค.
  • = ๊ฐ•ํ™”ํ•™์Šต์„ ์ƒˆ๋กœ์šด ๋ฌธ์ œ์— ์‰ฝ๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•จ
  • ์ •๊ทœํ™”, ๊ท ํ˜• ์กฐ์ • ๋ฐ ๋ณ€ํ™˜์— ๊ธฐ๋ฐ˜ํ•œ ๋‹ค์–‘ํ•œ ๊ฐ•๊ฑด์„ฑ ๊ธฐ์ˆ ์„ ํ†ตํ•ด ์ด๋Ÿฌํ•œ ์–ด๋ ค์›€์„ ๊ทน๋ณตํ•จ
  • Dreamer๋Š” ํ™˜๊ฒฝ ๋ชจ๋ธ์„ ๋ฐฐ์šฐ๊ณ  ๋ฏธ๋ž˜ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์ƒ์ƒํ•จ์œผ๋กœ์„œ ํ–‰๋™์„ ๊ฐœ์„ ํ•œ๋‹ค.
  • ๋ณ„๋„์˜ ์„ค์ • ์—†์ด ๋ฐ”๋กœ ์ ์šฉํ–ˆ์„ ๋•Œ, Dreamer๋Š” ํ˜„์žฌ๊นŒ์ง€ ์•Œ๋ ค์ง„ ๋ฐ”๋กœ๋Š” ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋‚˜ ์ปค๋ฆฌํ˜๋Ÿผ ์—†์ด MInecraft์—์„œ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์ด์•„๋ชฌ๋“œ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ์ตœ์ดˆ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.
  • ์ด ์„ฑ์ทจ๋Š” ์ธ๊ณต์ง€๋Šฅ ๋ถ„์•ผ์—์„œ ์ƒ๋‹นํ•œ ๋‚œ์ œ๋กœ ์ด๋ค„์ง€๋ฉฐ open world์—์„œ ํ”ฝ์…€๊ณผ ํฌ์†Œํ•œ ๋ณด์ƒ์„ ํ†ตํ•ด ์›๋Œ€ํ•œ ์ „๋žต์„ ๋ชจ์ƒ‰ํ•ด์•ผ ํ•œ๋‹ค.

Learning Algorithm

  • ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ World model, Critic, Actor๋ผ๋Š” ์„ธ ๊ฐ€์ง€ ์‹ ๊ฒฝ๋ง(neural network)๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.
  • ์—์ด์ „ํŠธ๊ฐ€ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ณ  ์žˆ๋Š” ๋™์•ˆ ๊ตฌ์„ฑ ์š”์†Œ๋“ค์€ ์žฌ์ƒ๋œ ๊ฒฝํ—˜์œผ๋กœ๋ถ€ํ„ฐ ๋™์‹œ์— ํ•™์Šต.

DreamerV3๋Š” ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์—์„œ ์ง์ ‘ ํ•™์Šตํ•˜๊ธฐ ๋ณด๋‹ค๋Š” "World model'์„ ํ•™์Šตํ•ด ๋ฏธ๋ž˜ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์ƒ์ƒํ•˜๋ฉด์„œ ์ •์ฑ…์„ ๊ฐœ์„ ํ•˜๋Š” ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต ๋ฐฉ์‹์ž„!

World model Learning

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์„ธ๊ณ„ ๋ชจ๋ธ์„ ์œ„ ๊ทธ๋ฆผ์—์„œ ๋ณด๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์ˆœํ™˜ ์ƒํƒœ ๊ณต๊ฐ„ ๋ชจ๋ธ(recurrent state-space model)๋กœ ๊ตฌํ˜„ํ–ˆ๋‹ค. ์šฐ์„  encoder๋Š” ๊ฐ๊ฐ ์ž…๋ ฅ xix_i๋ฅผ ํ›ˆ๋ จ ์‹œํ€€์Šค์—์„œ ๊ฐ time step tt์— ๋Œ€ํ•œ ํ™•๋ฅ ์  ํ‘œํ˜„ ztz_t๋กœ mappingํ•œ๋‹ค. ๊ทธ ๋‹ค์Œ ์ˆœํ™˜ ์ƒํƒœ hth_t๋ฅผ ๊ฐ–๋Š” ์‹œํ€€์Šค ๋ชจ๋ธ์€ ๊ณผ๊ฑฐ action atโˆ’1a_{t-1} ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ์ด๋Ÿฌํ•œ ํ‘œํ˜„๋“ค์˜ ์‹œํ€€์Šค๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. hth_t์™€ ztz_t์˜ concatenation์€ ๋ชจ๋ธ ์ƒํƒœ๋ฅผ ํ˜•์„ฑํ•˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๋ณด์ƒ rtr_t์™€ episode ์ง€์† flag ctโˆˆ{0,1}c_t \in \{0,1\} ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ  ์ž…๋ ฅ๊ฐ’์„ ์žฌ๊ตฌ์„ฑํ•ด์„œ ์ •๋ณด๋ ฅ์ด ๋†’์€ ํ‘œํ˜„์„ ๋ณด์žฅํ•œ๋‹ค.

์ผ๋‹จ world model์ด ๋ญ”์ง€๋ถ€ํ„ฐ ์•Œ์•„๋ณด์ž. ์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” world model์ด๋ž‘ ๊ฐ•ํ™” ํ•™์Šต ์—์ด์ „ํŠธ๊ฐ€ ํ™˜๊ฒฝ ์ž์ฒด๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ดํ•ดํ•˜๋Š” ๋ชจ๋ธ์„ ์˜๋ฏธํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‹ˆ๊นŒ ์‚ฌ๋žŒ์ด ์ฃผ๋ณ€ ์„ธ์ƒ์„ ๊ด€์ฐฐํ•˜๊ณ  ๊ฒฝํ—˜ํ•ด์„œ ์„ธ์ƒ์˜ ๊ทœ์น™์ด๋‚˜ ์—ญํ•™์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•˜๋‹ค. ์ด World model์€ ๊ฐ๊ฐ(sensory) ์ž…๋ ฅ์„ ์••์ถ•๋œ ํ˜•ํƒœ๋กœ ํ•™์Šตํ•˜๋Š”๋ฐ, ์ด๋–„ RSSM(Recurrent State-Space Model)์„ ์‚ฌ์šฉํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ x1,x2,x3x_1, x_2, x_3๋Š” ์—์ด์ „ํŠธ๊ฐ€ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๊ด€์ธกํ•œ ์ด๋ฏธ์ง€๋กœ ๊ฒŒ์ž„ ํ™”๋ฉด์ด๋‚˜ ์‹œ๊ฐ์  ์ž…๋ ฅ ๋“ฑ์ด ์žˆ๋‹ค. ์ด ๊ด€์ธก๋œ ์ด๋ฏธ์ง€๋Š” ์ž ์žฌ ๊ณต๊ฐ„(latent space)์˜ ๋ฒกํ„ฐ z1,z2,z3z_1, z_2, z_3๋กœ ์ธ์ฝ”๋”ฉ๋˜๋Š”๋ฐ ์ด ์ž ์žฌ ๋ฒกํ„ฐ๋Š” ์ด๋ฏธ์ง€์˜ ์ค‘์š”ํ•œ ํŠน์ง•์„ ์••์ถ•ํ•œ ํ‘œํ˜„์ด๋‹ค.

  • ๋จผ์ € ์ธ์ฝ”๋”์—์„œ ๊ณ ์ฐจ์› ์„ผ์„œ ์ž…๋ ฅ์„ ์••์ถ•๋œ ์ž ์žฌ ํ‘œํ˜„ (Latent Representation)์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
  • ๊ทธ๋Ÿผ ์ด์ „ ํ–‰๋™๊ณผ ์ž ์žฌ ์ƒํƒœ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ์ž ์žฌ ์ƒํƒœ์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•˜๋ฉฐ ์ˆœํ™˜ ์ƒํƒœ(Recurrent State)๋ฅผ ์œ ์ง€ํ•œ๋‹ค.
  • Dynamic Predictor : ์ด๋Ÿฐ ์ˆœํ™˜ ์ƒํƒœ๋งŒ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ์ž ์žฌ ์ƒํƒœ์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.
  • ๋ณด์ƒ ์˜ˆ์ธก๊ธฐ(Reward Predictor) : ์ž ์žฌ ์ƒํƒœ์—์„œ ์˜ˆ์ƒ๋˜๋Š” ๋ณด์ƒ์„ ์˜ˆ์ธกํ•œ๋‹ค
  • ๊ณ„์† ์˜ˆ์ธก๊ธฐ(Continue predictor) : ํ˜„์žฌ ์ƒํƒœ์—์„œ ์—ํ”ผ์†Œ๋“œ๊ฐ€ ์ข…๋ฃŒ๋ ์ง€ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.
  • ๋””์ฝ”๋”(Decoder) : ์ž ์žฌ ์ƒํƒœ์—์„œ ์›๋ž˜ ์„ผ์„œ ์ž…๋ ฅ์„ ์žฌ๊ตฌ์„ฑํ•œ๋‹ค. ์ด๋Š” ์ž ์žฌ ํ‘œํ˜„์ด ์ž…๋ ฅ์˜ ํ•ต์‹ฌ ์ •๋ณด๋ฅผ ๋‹ด๋„๋ก ๊ฐ•์ œํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

[!note] ์ž ์žฌ ์ƒํƒœ์—์„œ ์žฌ๊ตฌ์„ฑํ•˜๋Š”๊ฒŒ ์™œ ํ•ต์‹ฌ ์ •๋ณด๋ฅผ ๋‹ด๋„๋ก ํ•˜๋Š”๊ฑฐ์ง€?
๋ฐ์ดํ„ฐ ์••์ถ• ๊ณผ์ •์—์„œ ์ •๋ณด๊ฐ€ ์†์‹ค๋  ์ˆ˜ ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์žฌ๊ตฌ์„ฑ ์†์‹ค์ด ๋“ฑ์žฅํ•œ๋‹ค. ๋””์ฝ”๋”๋Š” ์••์ถ•๋œ ์ž ์žฌ ์ƒํƒœ๋งŒ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์›๋ž˜์˜ ์ž…๋ ฅ( xtx_t )๊ณผ ์ตœ๋Œ€ํ•œ ์œ ์‚ฌํ•œ xtx_t๋ฅผ ๋งŒ๋“ค๋ ค๊ณ  ์‹œ๋„ํ•œ๋‹ค. ์žฌ๊ตฌ์„ฑ ์†์‹ค์€ ๋””์ฝ”๋”๊ฐ€ ๋งŒ๋“ค์–ด๋‚ธ ์ถœ๋ ฅ๊ณผ ์›๋ž˜์˜ ์ž…๋ ฅ ์‚ฌ์ด๋ฅผ ์ธก์ •ํ•˜๊ณ , ์ด ์ฐจ์ด๋ฅผ ์ตœ์†Œํ™”ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚จ๋‹ค.

==ํ•™์Šต ๋ชฉํ‘œ==

๊ฒฐ๊ตญ world model์˜ ํ•™์Šต ๋ชฉํ‘œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ธ ๊ฐ€์ง€ ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

L(ฯ•)โ‰Eqฯ•[โˆ‘t=1T(ฮฒpredLpred(ฯ•)+ฮฒdynLdyn(ฯ•)+ฮฒrepLrep(ฯ•))].\mathcal{L}(\phi) \doteq E_{q_\phi} \left[ \sum_{t=1}^T (\beta_{\text{pred}} \mathcal{L}_{\text{pred}}(\phi) + \beta_{\text{dyn}} \mathcal{L}_{\text{dyn}}(\phi) + \beta_{\text{rep}} \mathcal{L}_{\text{rep}}(\phi)) \right].
  • ์˜ˆ์ธก ์†์‹ค, ๋™์—ญํ•™ ์†์‹ค, ํ‘œํ˜„ ์†์‹ค. -> ์ด ๋ถ€๋ถ„์€ ๊นŠ๊ฒŒ ๋“ค์–ด๊ฐ€๋ฉด ๊ฐ•ํ™”ํ•™์Šต์„ ์—ด์‹ฌํžˆ ๊ณต๋ถ€ํ•ด์•ผ ํ•œ๋‹ค...ใ…Žใ…Ž

Actor Critic Learning

Actor-Critic์€ ๊ฐ•ํ™”ํ•™์Šต ๋ฐฉ๋ฒ•๋ก  ์ค‘ ํ•˜๋‚˜๋กœ, ๋ฏธ๋ž˜์˜ ๊ฐ€์น˜๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ํ‰๊ฐ€์ž(Critic)๊ณผ Critic์˜ ํ”ผ๋“œ๋ฐฑ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ–‰๋™์„ ์„ ํƒํ•˜๋Š” ํ–‰์œ„์ž(Actor)๋กœ ์ด๋ค„์ ธ ์žˆ๋‹ค.

๋จผ์ € ์™ผ์ชฝ ์•„๋ž˜์— ์žˆ๋Š” ๊ฒŒ์ž„ ์ด๋ฏธ์ง€๋Š” ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๋ฐ›์•„์˜จ ๊ด€์ธก๊ฐ’ x1x_1์ด๋‹ค. ์ด๋Š” encoder๋ฅผ ํ†ตํ•ด ์ž ์ƒˆ ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜๋œ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ์ž ์žฌ ๋ฒกํ„ฐ z1z_1๋Š” ํ™•๋ฅ ์ (latent stochastic) ์ƒํƒœ๋กœ "๊ด€์ธก์œผ๋กœ๋ถ€ํ„ฐ ์–ป์€, ๋ญ๊ฐ€ ์žˆ์—ˆ์–ด??"์— ๋Œ€ํ•œ ์š”์•ฝ ์ •๋ณด๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์œ„์— ์žˆ๋Š” hth_t๋Š” ๊ฒฐ์ •์ (deterministic) ํžˆ๋“  ์ƒํƒœ๋กœ, ์ด์ „ ์‹œ๊ฐ„์˜ ํžˆ๋“  ์ƒํƒœ์™€, ํ˜„์žฌ ์ƒ˜ํ”Œ๋œ ztz_t๋ฅผ RNN์— ๋„ฃ์–ด์„œ ๋งŒ๋“ค์–ด์ง„๋‹ค.

ht+1=f(ht,zt,at)h_{t+1}=f(h_t, z_t, a_t)

์ด๊ฑด ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์ถ•์ ๋œ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๊ฐ™์€ ๊ฒƒ์ด๋ผ ์‹œ๊ฐ„ ํ๋ฆ„์„ ๋ฐ˜์˜ํ•œ๋‹ค.
๊ทธ๋ž˜์„œ ์ด ๋‘ ๊ฐœ๋ฅผ ํ•ฉ์นœ ๊ฒƒ์ด ๋ฐ”๋กœ st={ht,zt}s_t=\{h_t, z_t\} ์œผ๋กœ, Markov ์ƒํƒœ ํ‘œํ˜„์ด๋ฉฐ ์ด๊ฑธ ๊ธฐ๋ฐ˜์œผ๋กœ Actor์™€ Critic์ด ์ž‘๋™ํ•œ๋‹ค.

Critic ํ•™์Šต

๋จผ์ € Critic์˜ ๊ฒฝ์šฐ World Model์ด ์ƒ์ƒํ•œ ์ƒํƒœ๋กœ๋ถ€ํ„ฐ ์˜ˆ์ธก๋œ return ๋ถ„ํฌ๋ฅผ ํ•™์Šตํ•œ๋‹ค.

๋ชฉํ‘œ! vฯˆ(Rtโˆฃst)v_\psi(R_t | s_t)๋ฅผ ํ•™์Šตํ•˜๋ผ!

= ์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด ํ˜„์žฌ ์ƒํƒœ์˜ "๊ฐ€์น˜(๋ฏธ๋ž˜ ๋ณด์ƒ์˜ ์ดํ•ฉ)"๋ฅผ ์˜ˆ์ธกํ•˜์—ฌ Actor์˜ ํ•™์Šต์„ ๋„์™€์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.
์—ฌ๊ธฐ์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ƒ์ƒ ๊ธฐ๋ฐ˜ ํ•™์Šต, Trajectory Imagination์ด๋‹ค. ์‹ค์ œ ํ™˜๊ฒฝ์„ ๋งค๋ฒˆ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ(๋ˆ์ด ๋งŽ์ด ๋“ค๊ณ  ์‹œ๊ฐ„๋„ ๋“ค์–ด์š”!) World model์ด ์ƒ์ƒํ•œ ๊ฐ€์ƒ ์‹œ๋‚˜๋ฆฌ์˜ค๋กœ ํ•™์Šต์„ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

Actor:atโˆผฯ€ฮธ(atโˆฃst)Critic:vฯˆ(Rtโˆฃst)\text{Actor:} \quad a_t \sim \pi_\theta(a_t | s_t) \qquad \text{Critic:} \quad v_\psi(R_t | s_t)
  1. Trajectory ์ƒ์„ฑ
    ๋จผ์ € world model๊ณผ actor์„ ์ด์šฉํ•ด์„œ ์ƒ์ƒ ์†์—์„œ trajectory๋ฅผ ๋งŒ๋“ ๋‹ค. ์—ฌ๊ธฐ์„œ ์ด trajectory๋ผ๋Š” ๊ฒƒ์€ ์‰ฝ๊ฒŒ ๋งํ•ด ์ƒํƒœ, ํ–‰๋™, ๋ณด์ƒ์˜ ์ด ์ง‘ํ•ฉ์„ ๋งํ•œ๋‹ค. s1 -> a1 -> r1 -> s2 -> a2 -> .... ์ด๋Ÿฐ ๊ฒƒ!

  2. ฮป\lambda -return ๊ณ„์‚ฐ
    ์• ์ดˆ์— ๊ฐ•ํ™”ํ•™์Šต์—์„œ "์ง€๊ธˆ ๋‹น์žฅ์˜ ๋ณด์ƒ"๋งŒ ๋ณด๋Š” ๊ฒƒ์€ ๋ง์ด ์•ˆ๋œ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ฏธ๋ž˜์˜ ๋ณด์ƒ๊นŒ์ง€ ์ผ์ • ๋น„์œจ์„ ์„ž์–ด์„œ ๊ณ„์‚ฐ์„ ํ•ด์•ผ ํ•œ๋‹ค. ์ด๊ฑธ ฮป\lambda -return ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.
    Critic์€ ์ด๋ ‡๊ฒŒ bootstrapped ฮป\lambda-return์„ ์ด์šฉํ•ด์„œ ํ•™์Šตํ•œ๋‹ค.
    (์†”์งํžˆ ์—ฌ๊ธฐ๊นŒ์ง€ ์ฝ์—ˆ์„ ๋•Œ ์ด๊ฑฐ ์•„ํ‚คํ…์ฒ˜๋ณด๋‹ค ๊ฐ•ํ™”ํ•™์Šต ๋น„์ค‘์ด ๋†’๋‚˜ ์‹ถ์—ˆ์ง€๋งŒ ์ด๋ฏธ ๋Šฆ์—ˆ๋‹ค.)

Rtฮป=rt+ฮณct[(1โˆ’ฮป)vt+ฮปRt+1ฮป]R^\lambda_t = r_t + \gamma c_t \left[ (1 - \lambda) v_t + \lambda R^\lambda_{t+1} \right]
  • vtv_t : critic์ด ์˜ˆ์ธกํ•œ ํ˜„์žฌ ์ƒํƒœ์˜ ๊ธฐ๋Œ€ ๋ณด์ƒ
  • ฮณ\gamma=0.997 : ํ• ์ธ์œจ! -> ์ด๊ฑด ๋ฏธ๋ž˜ ๋ณด์ƒ ๊ฐ์‡ ๋ผ๋Š” ๊ฒƒ!
  1. ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ์˜ˆ์ธก
    ๋ณด์ƒ์˜ ํ˜•ํƒœ๊ฐ€ ๋‹ค์–‘ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— critic์€ ๋ณด์ƒ์„ ํ•˜๋‚˜์˜ ์ˆซ์ž๊ฐ€ ์•„๋‹ˆ๋ผ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ์—์ธกํ•œ๋‹ค.

  2. ๊ทธ ๋‹ค์Œ reaply buffer์„ ์ด์šฉํ•ด์„œ ์‹ค์ œ ๊ฒฝํ—˜์„ ํ˜ผํ•ฉํ•ด์„œ ๋ณด๊ฐ•ํ•˜๊ณ 

[!note] Replay buffer
๊ณผ๊ฑฐ์˜ ๊ฒฝํ—˜๋“ค(trajectory)์„ ์ €์žฅ๋‘๋Š” ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์œผ๋กœ ์•ˆ์—๋Š” ์ด๋ ‡๊ฒŒ ์ €์žฅ์ด ๋œ๋‹ค.

(xt,at,rt,xt+1,ct)(x_t, a_t, r_t, x_{t+1}, c_t)

๊ฐ•ํ™”ํ•™์Šต์€ ํ™˜๊ฒฝ์„ ์‹คํ–‰ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ๋ฅผ ์–ป๋Š” ๊ฒƒ์ด ๋น„์‹ธ๋‹ค. Replay buffer์„ ์ด์šฉํ•˜๋ฉด ์ €์žฅ๋œ ๊ณผ๊ฑฐ ๊ฒฝํ—˜์„ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณตํ•ด์„œ ํ•™์Šต์— ์ด์šฉํ•  ์ˆ˜ ์žˆ๋‹ค!

๐Ÿงฉ Dreamer์—์„œ๋Š” ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ?
Dreamer๋Š” world model์„ ํ•™์Šตํ•  ๋•Œ ๋‘ ๊ฐ€์ง€ ์ข…๋ฅ˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
1. ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ์ˆ˜์ง‘๋œ ๊ฒฝํ—˜ - repaly buffer ์ €์žฅ
2. world model์ด ์ƒ์ƒํ•œ trajectory - imagination trajectory

  1. ์ด์ „ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ง€์ˆ˜ ์ด๋™ ํ‰๊ท  ๋ฒ„์ „์„ ์‚ฌ์šฉํ•ด์„œ Critic์ด ์ž๊ธฐ ์ž์‹ ์„ ์ฐธ์กฐํ•˜๊ฒŒ ํ•˜์—ฌ ์˜ˆ์ธก ์•ˆ์ •ํ™”๋ฅผ ํ•œ๋‹ค.

Actor ํ•™์Šต

Actor์˜ ๊ฒฝ์šฐ๋Š” ๊ฐ ์‹œ์ ์—์„œ sts_t๋ฅผ ๋ฐ›์•„์„œ ํ–‰๋™ ata_t๋ฅผ ์ƒ˜ํ”Œ๋งํ•œ๋‹ค.

atโˆผฯ€ฮธ(atโˆฃst)a_t \sim \pi_\theta (a_t |s_t)

๊ทธ๋ฆผ์—์„œ๋Š” ์กฐ์ด์Šคํ‹ฑ์ด ์›€์ง์ด๋Š” ๊ฒƒ์„ a๋ผ๊ณ  ํ•œ๋‹ค.
๊ทธ๋Ÿฐ๋ฐ ์—ฌ๊ธฐ์„œ actor๊ฐ€ ์ฃผ์˜ํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ์€, ๋„ˆ๋ฌด ๋˜‘๊ฐ™์€ ํ–‰๋™๋งŒ ๋ฐ˜๋ณตํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ ‡๊ธฐ์— ์šฐ๋ฆฌ๋Š” ํƒ์ƒ‰(exploration)์„ ํ•œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ด ํƒํ—˜์˜ ์ ์ ˆํ•œ ๊ฐ•๋„๋ฅผ ํ™˜๊ฒฝ ๋‚ด์—์„œ ๋ณด์ƒ์˜ ํฌ๊ธฐ(scale)์™€ ๋นˆ๋„(frequency) ๋ชจ๋‘์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง„๋‹ค. ์ด์ƒ์ ์œผ๋กœ๋Š” ๋ณด์ƒ์ด ๋“œ๋ฌธ ํ™˜๊ฒฝ์—์„œ๋Š” ๋” ๋งŽ์ด ํƒ์ƒ‰ํ•˜๊ณ , ๋ณด์ƒ์ด ์ž์ฃผ ๋‚˜์˜ค๊ฑฐ๋‚˜ ๊ฐ€๊นŒ์ด์— ์žˆ์„ ๋•Œ๋Š” ๋” ๋งŽ์ด ์ด์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์œผ๋ฉฐ ๋ณด์ƒ์˜ ์ ˆ๋Œ€์ ์ธ ํฌ๊ธฐ ๋ณ€ํ™”์— ํƒ์ƒ‰ ๊ฐ•๋„๊ฐ€ ์˜ํ–ฅ์„ ๋ฐ›์•„์„œ๋Š” ์•ˆ๋œ๋‹ค. ๋•Œ๋ฌธ์— ๋…ผ๋ฌธ์—์„œ๋Š” ๋ณด์ƒ์˜ ํฌ๊ธฐ๋ฅผ ์ •๊ทœํ™”ํ–ˆ๋‹ค.

Lห™=โˆ’โˆ‘t=1Tsg(Rtฮปโˆ’vฯˆ(st))maxโก(1,S)logโกฯ€ฮธ(atโˆฃst)+ฮทH[ฯ€ฮธ(atโˆฃst)]\dot{L} = - \sum_{t=1}^{T} \frac{\text{sg}(R_t^\lambda - v_\psi(s_t))}{\max(1, S)} \log \pi_\theta(a_t | s_t) + \eta H[\pi_\theta(a_t | s_t)]

์†”์งํžˆ ๋งค์šฐ ์–ด๋ ต๋‹ค...
๊ทธ๋ƒฅ ๋ณด์ƒ์˜ ์Šค์ผ€์ผ์ด ํ™˜๊ฒฝ๋งˆ๋‹ค ๋‹ฌ๋ผ์ง€๋‹ˆ๊นŒ return ์ž์ฒด๋ฅผ ์ •๊ทœํ™”ํ•ด์„œ ๊ท ํ˜• ์žˆ๊ฒŒ ํ•™์Šตํ•˜๋„๋ก ๋„์™€์ฃผ๋Š” Actor์˜ ์†์‹คํ•จ์ˆ˜๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋จ!

Results

Benchmarks

์—ฐ์†์ ์ด๊ณ  ๋ถˆ์—ฐ์†์ ์ธ ํ–‰๋™, ์‹œ๊ฐ์  ๋ฐ ์ €์ฐจ์› ์ž…๋ ฅ, ์กฐ๋ฐ€ํ•˜๊ณ  ํฌ์†Œํ•œ ๋ณด์ƒ, ๋‹ค์–‘ํ•œ ๋ณด์ƒ ์ฒ™๋„, 2D ๋ฐ 3D ์„ธ๊ณ„, ์ ˆ์ฐจ์  ์ƒ์„ฑ์„ ํฌํ•จํ•˜๋Š” 8๊ฐœ์˜ ๋„๋ฉ”์ธ์—์„œ ๊ด‘๋ฒ”์œ„ํ•œ ์—ฐ๊ตฌ๋ฅผ ์ˆ˜ํ–‰ํ•จ!

  • Atari, ProcGen, DMLab, Minecraft, Atari100k, Proprio Control, Visual Control, BSuite

Dreamer๋Š” ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ PPO๋ณด๋‹ค ํ›จ์”ฌ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค.

Minecraft

๋งˆ์ธํฌ๋ž˜ํ”„ํŠธ์—์„œ ๋‹ค์ด์•„๋ชฌ๋“œ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๊ฒƒ์€ ์ธ๊ณต์ง€๋Šฅ ๋ถ„์•ผ์—์„œ ์˜ค๋žœ ๊ธฐ๊ฐ„ ๋™์•ˆ ๋‚œ์ œ์˜€๋‹ค. (์ตœ๊ทผ์—๋Š” MAS๋ฅผ ์œ„ํ•œ ๋งˆ์ธํฌ๋ž˜ํ”„ํŠธ ๋ฒค์น˜๋งˆํฌ๋„ ์ƒˆ๋กœ ์ƒ๊ฒผ๋˜๋ฐ, ๊ทธ๋Ÿฐ ๊ฑฐ ๋ณด๋ฉด ๋งˆํฌ๋Š” ์—ญ์‹œ ๊ณ„์† ์‚ฌ์šฉ๋  ๊ฒƒ ๊ฐ™์Œ,,,)
Dreamer๋Š” ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ ๋„ ๋งˆ์ธํฌ๋ž˜ํ”„ํŠธ์—์„œ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์ด์•„๋ชฌ๋“œ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ์ตœ์ดˆ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

๋ชฐ๋ž๋Š”๋ฐ MineRL์ด ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€ ๊ถค์ (trajectory) ๋ฐ์ดํ„ฐ์…‹์„ ์ œ๊ณตํ•œ๋‹ค๊ณ  ํ•จ.

Ablation

a. Robustness techniques

  • Dreamer์— ๋„์ž…๋œ ์—ฌ๋Ÿฌ ์•ˆ์ •ํ™”/์ •๊ทœํ™” ๊ธฐ๋ฒ•๋“ค์„ ํ•˜๋‚˜์”ฉ ์ œ๊ฑฐํ•œ ๊ฒƒ! ๋ชจ๋‘ ์˜๋ฏธ๊ฐ€ ์žˆ์ง€๋งŒ ํŠนํžˆ KL objective, return nomalization, symexp twohot loss๊ฐ€ ํŠนํžˆ ์ค‘์š”ํ•จ

b. Learning signals (ํ•™์Šต ์‹ ํ˜ธ์˜ ์˜ํ–ฅ)

  • world model์„ ํ•™์Šตํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๋‘ ๊ฐ€์ง€ gradient ์‹ ํ˜ธ๋ฅผ ๊ฐ๊ฐ ์ œ๊ฑฐํ–ˆ๋‹ค.
  • 1) ๋ณด์ƒ๊ณผ ๊ฐ€์น˜ ์˜ˆ์ธก์— ๋Œ€ํ•œ gradient ์ œ๊ฑฐ -> ์„ฑ๋Šฅ ์•ฝ๊ฐ„ ํ•˜๋ฝ
  • 2) ๊ด€์ธก๊ฐ’ ๋ณต์›(reconstruction)์— ๋Œ€ํ•œ gradient ์ œ๊ฑฐ -> ๊ธ‰๋ฝ

[!question] ๐Ÿ˜ตโ€๐Ÿ’ซ ์™œ์ผ๊นŒ?
์ผ๋‹จ ๊ด€์ธก ๋ณต์›์ด๋ž€ ๋ชจ๋ธ์ด ์ž ์žฌ ์ƒํƒœ ํ‘œํ˜„ (ht,zth_t, z_t)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์กŒ๋˜ ๊ด€์ธก xtx_t์„ ๋‹ค์‹œ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ฒƒ์ด์—ˆ๋‹ค. ์ฆ‰, ๋ชจ๋ธ์ด ํ’๋ถ€ํ•˜๊ณ  ์ผ๋ฐ˜์ ์ธ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋„๋ก ๊ฐ•์ œํ•˜๊ณ  ๋ชจ๋ธ์ด ํ™˜๊ฒฝ์„ ์ƒ์ƒํ•˜๊ณ  ์˜ˆ์ธกํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํ‚ค์šฐ๋Š”๋ฐ ๋งค์šฐ ์ค‘์š”ํ–ˆ๋‹ค.
๊ทธ๋Ÿผ ์—ฌ๊ธฐ์„œ ๋ณด์ƒ/๊ฐ€์น˜ ์˜ˆ์ธก ์ œ๊ฑฐ๋Š” ์™œ ๋œ ์ค‘์š”ํ•˜๋ƒ... ํ• ์ˆ˜๋„ ์žˆ๋‹ค.
์ผ๋‹จ ์ด๊ฑด ์• ์ดˆ์— Dreamer๋ผ๋Š” ๋ชจ๋ธ์ด ๋ฌด์–ผ ๋ชฉ์ ์œผ๋กœ ํ•˜๋ƒ๊ฐ€ ์ค‘์š”ํ•˜๋‹ค. Dreamer๋Š” "๋ณด์ƒ์„ ์ž˜ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ"์ด ์•„๋‹ˆ๋ผ "์„ธ์ƒ์„ ์ž˜ ์ƒ์ƒํ•˜๊ณ  ์ดํ•ดํ•˜๋Š” ๋ชจ๋ธ"์ด๋‹ค. world model์ด ํ•™์Šตํ•œ ํ‘œํ˜„์„ ๊ธฐ๋ฐ˜์œผ๋กœ actor์™€ critic์€ ์ƒ์ƒ ์†์—์„œ ๋”ฐ๋กœ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— world model์ด ๋ณด์ƒ์„ ์ง์ ‘ ์ž˜ ์˜ˆ์ธกํ•˜์ง€ ์•Š์•„๋„ ์ƒ๊ด€์—†๋‹ค!

c. Model size scaling (๋ชจ๋ธ ํฌ๊ธฐ ํ™•์žฅ)

  • ๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ์ปค์งˆ ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ

d. Replay buffer ํฌ๊ธฐ ์˜ํ–ฅ

  • ์ถฉ๋ถ„ํžˆ ํฐ Replay buffer๋Š” ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ์œ ๋„ํ•จ

Conclusion

  • ๊ณ ์ •๋œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋กœ๋„ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋ฒ”์šฉ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • Dreamer๋Š” 150๊ฐœ ์ด์ƒ์˜ ๊ณผ์ œ์—์„œ ํƒ์›”ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ๋ฐ ์—ฐ์‚ฐ ์ž์› ์กฐ๊ฑด์—์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šตํ•จ์œผ๋กœ์จ, ์‹ค์ œ ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ์˜ ๋„์•ฝ์„ ๋ณด์—ฌ์คŒ
  • Dreamer๋Š” ์ถ”๊ฐ€ ์กฐ์ • ์—†์ด(out-of-the-box) ์‚ฌ์šฉ๋˜์–ด, ๋งˆ์ธํฌ๋ž˜ํ”„ํŠธ์—์„œ ๋‹ค์ด์•„๋ชฌ๋“œ๋ฅผ ์ˆ˜์ง‘ํ•œ ์ตœ์ดˆ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • Dreamer๋Š” ํ•™์Šต๋œ ์›”๋“œ ๋ชจ๋ธ ๊ธฐ๋ฐ˜์˜ ๊ณ ์„ฑ๋Šฅ ์•Œ๊ณ ๋ฆฌ์ฆ˜
profile
์–์–

0๊ฐœ์˜ ๋Œ“๊ธ€