๐ŸŽ“ RL4 - PPO(Proximal Policy Optimization)

MinSeok_CSEยท2025๋…„ 7์›” 21์ผ

Reinforcement Learning

๋ชฉ๋ก ๋ณด๊ธฐ
4/5

๐ŸŽ“ PPO(Proximal Policy Optimization) ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ์š”

2017๋…„, OpenAI๋Š” "Proximal Policy Optimization Algorithms"์ด๋ผ๋Š” ๋…ผ๋ฌธ์„ ํ†ตํ•ด, ๊ธฐ์กด ์ •์ฑ… ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•œ ์ƒˆ๋กœ์šด ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ–ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ PPO๋Š”, ์•ˆ์ •์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋™์‹œ์— ํ™•๋ณดํ•˜๋ฉด์„œ๋„ ๊ตฌํ˜„์ด ๊ฐ„๋‹จํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ํ‰๊ฐ€๋ฐ›์œผ๋ฉฐ, ์ดํ›„ ๋งŽ์€ ๊ฐ•ํ™”ํ•™์Šต ํ™˜๊ฒฝ์—์„œ ์‚ฌ์‹ค์ƒ ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์ž๋ฆฌ ์žก๊ฒŒ ๋˜์—ˆ๋‹ค.

๊ฐ•ํ™”ํ•™์Šต์—์„œ ์ •์ฑ… ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€, ํ–‰๋™์„ ๊ฒฐ์ •ํ•˜๋Š” ํ™•๋ฅ  ๋ถ„ํฌ์ธ โ€˜์ •์ฑ…โ€™์„ ์ง์ ‘ ํ•™์Šตํ•œ๋‹ค. ํ•˜์ง€๋งŒ ๊ธฐ์กด์˜ ์ •์ฑ… ๊ฒฝ์‚ฌ๋ฒ•์€ ํ•™์Šต ์•ˆ์ •์„ฑ์ด ๋–จ์–ด์ง€๊ณ , ๋•Œ๋กœ๋Š” ์ •์ฑ…์ด ์ง€๋‚˜์น˜๊ฒŒ ํฌ๊ฒŒ ๋ณ€๊ฒฝ๋˜์–ด ์„ฑ๋Šฅ์ด ์•…ํ™”๋˜๋Š” ๋ฌธ์ œ๋ฅผ ์ž์ฃผ ๊ฒช์—ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด TRPO(Trust Region Policy Optimization)์™€ ๊ฐ™์€ ์•ˆ์ •์  ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋“ฑ์žฅํ–ˆ์ง€๋งŒ, ์ด๋Š” ๊ณ„์‚ฐ์ด ๋ณต์žกํ•˜๊ณ  ๊ตฌํ˜„์ด ์–ด๋ ค์›Œ ์‹ค์ œ ์ ์šฉ์ด ์ œํ•œ์ ์ด์—ˆ๋‹ค.

PPO๋Š” ์ด TRPO์˜ ์žฅ์ ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๋‹จ์ˆœํ•œ ํด๋ฆฌํ•‘ ๋ฐฉ์‹์„ ๋„์ž…ํ•˜์—ฌ ์•ˆ์ •์ ์ธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค. ์ •์ฑ…์ด ๋„ˆ๋ฌด ๊ธ‰๊ฒฉํžˆ ๋ฐ”๋€Œ์ง€ ์•Š๋„๋ก ์กฐ์ •ํ•ด์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ, ํ•™์Šต ๋„์ค‘ ์ •์ฑ…์ด ํŠ€๊ฑฐ๋‚˜ ๋ฌด๋„ˆ์ง€๋Š” ์ƒํ™ฉ์„ ๋ฐฉ์ง€ํ•œ๋‹ค.

์ด์ฒ˜๋Ÿผ PPO๋Š” ๊ธฐ์กด ์ •์ฑ… ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต์˜ ๋ถˆ์•ˆ์ •์„ฑ๊ณผ ๊ตฌํ˜„ ๋ณต์žก์„ฑ์„ ๋™์‹œ์— ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํƒ„์ƒํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ๋”ฐ๋ผ์„œ PPO๋ฅผ ๋ฐฐ์šฐ๊ธฐ ์ „์—, ๋จผ์ € ์™œ ๊ธฐ์กด ์ •์ฑ… ๊ฒฝ์‚ฌ๋ฒ•์ด ๋ถˆ์•ˆ์ •ํ–ˆ๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  PPO๊ฐ€ ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ๊ทธ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ๋Š”์ง€๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.

๐ŸŽ“ TRPO ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ž€?

๊ฐ•ํ™”ํ•™์Šต์—์„œ ์ •์ฑ… ๊ธฐ๋ฐ˜(policy-based) ์ ‘๊ทผ์€, ํ™˜๊ฒฝ ๋‚ด์—์„œ ํ–‰๋™์„ ๊ฒฐ์ •ํ•˜๋Š” ์ •์ฑ…(policy) ์ž์ฒด๋ฅผ ํ•™์Šตํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์˜ ์ •์ฑ… ๊ฒฝ์‚ฌ๋ฒ•(Policy Gradient)์€ ์ •์ฑ…์ด ๋„ˆ๋ฌด ํฌ๊ฒŒ ๋ฐ”๋€Œ๋Š” ๋ฌธ์ œ, ์ฆ‰ ํ•™์Šต ๋„์ค‘ ์„ฑ๋Šฅ์ด ์˜คํžˆ๋ ค ์•…ํ™”๋˜๋Š” ํ˜„์ƒ์„ ์ž์ฃผ ๊ฒช๋Š”๋‹ค. ์ด๋Ÿฌํ•œ ๋ถˆ์•ˆ์ •์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 2015๋…„, Schulman ๋“ฑ์ด ์ œ์•ˆํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋ฐ”๋กœ TRPO(Trust Region Policy Optimization)์ด๋‹ค.

TRPO์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค : "์ •์ฑ…์„ ๋ฐ”๊พธ๋˜, ๋„ˆ๋ฌด ๋งŽ์ด ๋ฐ”๊พธ์ง€ ๋ง์ž."

TRPO๋Š” ์ •์ฑ… ์—…๋ฐ์ดํŠธ ์‹œ ๋ณ€ํ™” ํญ์„ ์ œํ•œํ•˜์—ฌ, ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๋†’์ด๊ณ  ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ •์ฑ… ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•˜๋Š” KL Divergence(์ฟจ๋ฐฑ-๋ผ์ด๋ธ”๋Ÿฌ ๋ฐœ์‚ฐ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ๊ธฐ์กด ์ •์ฑ…๊ณผ ์ƒˆ๋กœ์šด ์ •์ฑ… ์‚ฌ์ด์˜ '๊ฑฐ๋ฆฌ'๊ฐ€ ์ผ์ • ์ˆ˜์ค€์„ ๋„˜์ง€ ์•Š๋„๋ก ์ œํ•œํ•˜๋ฉด์„œ ์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์กด ์ •์ฑ…์—์„œ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋Š” ์‹ ๋ขฐ ์˜์—ญ(trust region) ์•ˆ์—์„œ๋งŒ ์ •์ฑ…์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ์‹ค์ œ๋กœ ์•ˆ์ •์ ์ธ ์ˆ˜๋ ด๊ณผ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ ํšจ๊ณผ๋ฅผ ์ž…์ฆํ–ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ TRPO๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹จ์ ๋„ ์ง€๋‹Œ๋‹ค. ์ด์ฐจ ๋ฏธ๋ถ„ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™” (Second-order optimization)๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์ˆ˜ํ•™์ ์œผ๋กœ ๋ณต์žกํ•˜๊ณ  ์‹ค์ œ ๊ตฌํ˜„์ด ์–ด๋ ต๊ณ  ๊ณ„์‚ฐ ๋น„์šฉ์ด ํฌ๋‹ค.

์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ๋“ฑ์žฅํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋ฐ”๋กœ PPO(Proximal Policy Optimization)์ด๋‹ค. PPO๋Š” TRPO์˜ ํ•ต์‹ฌ ๊ฐœ๋…์ธ โ€œ์ •์ฑ… ๋ณ€ํ™”์˜ ์–ต์ œโ€๋ฅผ ๋” ๋‹จ์ˆœํ•œ ๋ฐฉ์‹์œผ๋กœ ๊ตฌํ˜„ํ•˜๋ฉด์„œ๋„, ์„ฑ๋Šฅ์€ ๊ฑฐ์˜ ์œ ์‚ฌํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ๊ฒฝ์šฐ๋„ ๋งŽ๋‹ค.

๐ŸŽ“ PPO(Proximal Policy Optimization) ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ •๋ฆฌ

โœ… PPO์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด: ์ •์ฑ… ๋ณ€ํ™” ์–ต์ œ

PPO๋Š” ํ˜„์žฌ ์ •์ฑ…๊ณผ ์ด์ „ ์ •์ฑ…์ด ํŠน์ • ํ–‰๋™์„ ์–ผ๋งˆ๋‚˜ ์„ ํ˜ธํ•˜๋Š”์ง€์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ ์ฐจ์ด๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด ํด๋ฆฌํ•‘(clipping)์„ ํ†ตํ•ด ์ œํ•œํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

์ด๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ’์ด ๋ฐ”๋กœ ์ •์ฑ… ๋น„์œจ(importance sampling ratio)์ด๋‹ค.

rt(ฮธ)=ฯ€ฮธ(atโˆฃst)ฯ€ฮธold(atโˆฃst)r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}
  • ฯ€ฮธ(atโˆฃst)\pi_\theta(a_t \mid s_t) : ํ˜„์žฌ ์ •์ฑ…์ด ์ƒํƒœ sts_t์—์„œ ํ–‰๋™ ata_t๋ฅผ ์„ ํƒํ•  ํ™•๋ฅ 
  • ฯ€ฮธold(atโˆฃst)\pi_{\theta_{\text{old}}}(a_t \mid s_t) : ์ด์ „ ์ •์ฑ…์˜ ํ™•๋ฅ 
  • rt(ฮธ)r_t(\theta)>1: ํ–‰๋™ ํ™•๋ฅ ์ด ์ฆ๊ฐ€ํ•จ
  • rt(ฮธ)r_t(\theta)<1: ํ–‰๋™ ํ™•๋ฅ ์ด ๊ฐ์†Œํ•จ

์ด ๋น„์œจ์ด 1๋ณด๋‹ค ํฌ๋ฉด ์ •์ฑ…์ด ํ•ด๋‹น ํ–‰๋™์˜ ํ™•๋ฅ ์„ ๋†’์˜€๋‹ค๋Š” ์˜๋ฏธ์ด๊ณ , 1๋ณด๋‹ค ์ž‘์œผ๋ฉด ํ™•๋ฅ ์„ ๋‚ฎ์ท„๋‹ค๋Š” ๋œป

๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์„ ํ†ตํ•œ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ ํ–ฅ์ƒ : PPO๋Š” ํ™˜๊ฒฝ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ฒˆ ์ˆ˜์ง‘ํ•œ ๋’ค, ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ์—ํฌํฌ(epoch)์— ๊ฑธ์ณ ๋ฐ˜๋ณต ํ•™์Šตํ•œ๋‹ค. ๊ธฐ์กด ๋ฐฉ์‹์—์„œ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜๋ฉด ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•ด์กŒ์ง€๋งŒ, PPO๋Š” ํด๋ฆฌํ•‘ ๊ธฐ๋ฒ• ๋•๋ถ„์— ์ •์ฑ…์ด ์•ˆ์ •์ ์œผ๋กœ ์œ ์ง€๋˜๋ฏ€๋กœ ๋†’์€ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

โœ… Clipped Surrogate Objective

๊ทธ๋ฆฌ๊ณ  ์ด ๋น„์œจ์ด ๋„ˆ๋ฌด ์ปค์ง€๊ฑฐ๋‚˜ ์ž‘์•„์ง€์ง€ ์•Š๋„๋ก PPO๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ Clipped Surrogate Loss๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

LtCLIP(ฮธ)=minโก(rt(ฮธ)โ‹…A^t,โ€…โ€Šclip(rt(ฮธ),1โˆ’ฯต,1+ฯต)โ‹…A^t)L_t^{\text{CLIP}}(\theta) = \min \left( r_t(\theta) \cdot \hat{A}_t,\; \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \cdot \hat{A}_t \right)
  • A^t\hat{A}_t : ์–ด๋“œ๋ฐดํ‹ฐ์ง€ ์ถ”์ •๊ฐ’ (advantage estimate)
  • ฯต: ํ—ˆ์šฉ ๊ฐ€๋Šฅํ•œ ์ •์ฑ… ๋ณ€ํ™” ๋ฒ”์œ„ (์˜ˆ: 0.1 ~ 0.3)

์ด ์†์‹ค ํ•จ์ˆ˜๋Š” ๋‘ ํ•ญ ์ค‘ ์ž‘์€ ๊ฐ’๋งŒ์„ ์‚ฌ์šฉ

  • Advantage๊ฐ€ ์–‘์ˆ˜์ด๋ฉด ๋ณด์ƒ์„ ๋„ˆ๋ฌด ๋งŽ์ด ์ฃผ์ง€ ์•Š๋„๋ก ์ œํ•œํ•˜๊ณ 
    • rt(ฮธ)>1+ฯตr_t(\theta) > 1 + \epsilon โ†’ ๋ณด์ƒ ๊ณผ๋‹ค โ†’ ํด๋ฆฌํ•‘ํ•˜์—ฌ ์–ต์ œ
  • Advantage๊ฐ€ ์Œ์ˆ˜์ด๋ฉด ๋ฒŒ์ ์ด ๊ณผ๋„ํ•˜์ง€ ์•Š๋„๋ก ๋ง‰์•„์ค๋‹ˆ๋‹ค.
    • rt(ฮธ)<1โˆ’ฯตr_t(\theta) < 1 - \epsilon โ†’ ๋ฒŒ์  ๊ณผ๋‹ค โ†’ ํด๋ฆฌํ•‘ํ•˜์—ฌ ์–ต์ œ

์ด๋ฅผ ํ†ตํ•ด ์ •์ฑ…์ด ์•ˆ์ •์ ์œผ๋กœ ๊ฐœ์„ ๋˜๋„๋ก ์œ ๋„

โœ… PPO ์†์‹ค ํ•จ์ˆ˜ ๊ตฌ์„ฑ

PPO์˜ ์ „์ฒด ์†์‹ค ํ•จ์ˆ˜๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€ ํ•ญ๋ชฉ์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.:

๐Ÿ“Œ L(ฮธ,ฯ•)=โˆ’LCLIP(ฮธ)+c1โ‹…LVF(ฯ•)โˆ’c2โ‹…LENT(ฮธ)L(\theta, \phi) = -L_{\text{CLIP}}(\theta) + c_1 \cdot L_{\text{VF}}(\phi) - c_2 \cdot L_{\text{ENT}}(\theta)

  • ฮธ\theta: ์ •์ฑ… ํŒŒ๋ผ๋ฏธํ„ฐ (Actor)
  • ฯ•\phi: ๊ฐ€์น˜ ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ (Critic)
  • c1,c2c_1,c_2: ๊ฐ๊ฐ ๊ฐ’ ํ•จ์ˆ˜ ์†์‹ค๊ณผ ์—”ํŠธ๋กœํ”ผ ํ•ญ์˜ ๊ฐ€์ค‘์น˜

1๏ธโƒฃ ์ •์ฑ… ์†์‹ค LCLIP(ฮธ)L^{\text{CLIP}}(\theta)
์ •์ฑ…์ด ๋„ˆ๋ฌด ๊ธ‰๊ฒฉํ•˜๊ฒŒ ๋ฐ”๋€Œ์ง€ ์•Š๋„๋ก ์–ต์ œํ•˜๋Š” ํ•ญ๋ชฉ์œผ๋กœ, PPO์˜ ํ•ต์‹ฌ

๐Ÿ“Œ LCLIP(ฮธ)=Et[minโก(rt(ฮธ)โ‹…A^t,โ€…โ€Šclip(rt(ฮธ),1โˆ’ฯต,1+ฯต)โ‹…A^t)]L_{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \cdot \hat{A}_t,\; \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \cdot \hat{A}_t \right) \right]

2๏ธโƒฃ ๊ฐ’ ํ•จ์ˆ˜ ์†์‹ค LVF(ฯ•)L^{\text{VF}}(\phi)
๊ฐ€์น˜ ํ•จ์ˆ˜๊ฐ€ ์˜ˆ์ธกํ•œ ๊ฐ’๊ณผ ์‹ค์ œ ๋ณด์ƒ์˜ ์ฐจ์ด๋ฅผ ์ค„์ด๋Š” ํ•ญ๋ชฉ์œผ๋กœ, ํ‰๊ท ์ œ๊ณฑ์˜ค์ฐจ(MSE)๋ฅผ ์‚ฌ์šฉ

๐Ÿ“Œ LVF(ฯ•)=12โ‹…Et[(Vฯ•(st)โˆ’Rt)2]L_{\text{VF}}(\phi) = \frac{1}{2} \cdot \mathbb{E}_t \left[ \left( V_\phi(s_t) - R_t \right)^2 \right]

  • Vฯ•(st)V_\phi(s_t): ์ƒํƒœ sts_t์— ๋Œ€ํ•œ ์˜ˆ์ธก ๊ฐ€์น˜
  • RtR_t: ์‹ค์ œ ๋ˆ„์  ๋ณด์ƒ
    โ†’ MSE ๊ธฐ๋ฐ˜ ํšŒ๊ท€ ์†์‹ค

3๏ธโƒฃ ์—”ํŠธ๋กœํ”ผ ๋ณด๋„ˆ์Šค LENT(ฮธ)L^{\text{ENT}}(\theta)
์ •์ฑ…์˜ ๋ฌด์ž‘์œ„์„ฑ์„ ์œ ์ง€ํ•ด์„œ ๋‹ค์–‘ํ•œ ํ–‰๋™์„ ์‹œ๋„ํ•˜๊ฒŒ๋” ๋„์™€์ฃผ๋Š” ํ•ญ๋ชฉ

๐Ÿ“Œ LENT(ฮธ)=Et[H(ฯ€ฮธ(โ‹…โˆฃst))]L_{\text{ENT}}(\theta) = \mathbb{E}_t \left[ H(\pi_\theta(\cdot \mid s_t)) \right]

  • HH: ์ •์ฑ…์˜ ์—”ํŠธ๋กœํ”ผ (๋ฌด์ž‘์œ„์„ฑ)
    โ†’ ํƒํ—˜์„ ์œ ์ง€ํ•˜๊ณ  ๊ณผ๋„ํ•œ ๊ฒฐ์ •๋ก ์  ์ •์ฑ…์„ ๋ฐฉ์ง€

โœ… ์ ์‘ํ˜• KL ํŽ˜๋„ํ‹ฐ (Adaptive KL Penalty)

PPO ๋…ผ๋ฌธ์—์„œ๋Š” ์šฐ๋ฆฌ๊ฐ€ ์ฃผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ํด๋ฆฌํ•‘(clipping) ๋ฐฉ์‹ ์™ธ์—, ์ •์ฑ… ๋ณ€ํ™”๋ฅผ ์–ต์ œํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ์ ‘๊ทผ๋ฒ•์„ ํ•จ๊ป˜ ์ œ์•ˆํ•˜๊ณ  ๋น„๊ตํ•ด๋ณด์•˜๋‹ค. ๋ฐ”๋กœ ์ ์‘ํ˜• KL ํŽ˜๋„ํ‹ฐ(Adaptive KL Penalty) ๋ฐฉ์‹์ด๋‹ค.

์ด ๋ฐฉ๋ฒ•์€ ์ •์ฑ… ์—…๋ฐ์ดํŠธ์˜ ํฌ๊ธฐ๋ฅผ ํด๋ฆฌํ•‘์œผ๋กœ ์ง์ ‘ ์ œํ•œํ•˜๋Š” ๋Œ€์‹ , ๋ชฉ์  ํ•จ์ˆ˜์— KL ๋ฐœ์‚ฐ(KL Divergence) ํ•ญ์„ ํŽ˜๋„ํ‹ฐ๋กœ ์ถ”๊ฐ€ํ•œ๋‹ค.

๋ชฉ์  ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

๐Ÿ“Œ LKLPEN(ฮธ)=E^t[ฯ€ฮธ(atโˆฃst)ฯ€ฮธold(atโˆฃst)โ€‰A^tโ€…โ€Šโˆ’โ€…โ€Šฮฒโ€‰KL[ฯ€ฮธoldโ€‰โˆฅโ€‰ฯ€ฮธ]]L_{KLPEN}(\theta)=\hat{E}_t\Bigl[\frac{\pi_{\theta}(a_t\mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t\mid s_t)}\,\hat{A}_t\;-\;\beta\,\mathrm{KL}\bigl[\pi_{\theta_{\mathrm{old}}}\,\|\,\pi_\theta\bigr]\Bigr]

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ ํŽ˜๋„ํ‹ฐ์˜ ๊ฐ•๋„๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๊ณ„์ˆ˜ ฮฒ\beta๋ฅผ ๊ณ ์ •ํ•˜์ง€ ์•Š๊ณ  ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•œ๋‹ค๋Š” ์ ์ด๋‹ค.

์—…๋ฐ์ดํŠธ ๊ทœ์น™: ๋งค ์ •์ฑ… ์—…๋ฐ์ดํŠธ ํ›„, ์ด์ „ ์ •์ฑ…๊ณผ ํ˜„์žฌ ์ •์ฑ… ์‚ฌ์ด์˜ ์‹ค์ œ KL ๋ฐœ์‚ฐ ๊ฐ’(dd)๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
1๏ธโƒฃ ์‹ค์ œ KL ๋ฐœ์‚ฐ ๊ฐ’ d=KL[ฯ€ฮธoldโ€‰โˆฅโ€‰ฯ€ฮธ]d = \mathrm{KL}\bigl[\pi_{\theta_{\mathrm{old}}}\,\|\,\pi_\theta\bigr]๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

2๏ธโƒฃ d<dtarg1.5d < \tfrac{d_{\mathrm{targ}}}{1.5}์ด๋ฉด, ์ •์ฑ… ๋ณ€ํ™”๊ฐ€ ๋” ํ•„์š”ํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•˜์—ฌ ํŽ˜๋„ํ‹ฐ๋ฅผ ์ค„์ธ๋‹ค(ฮฒโ†ฮฒ2\beta \leftarrow \tfrac{\beta}{2}).

3๏ธโƒฃ d>dtargร—1.5d > d_{\mathrm{targ}} \times 1.5์ด๋ฉด, ์ •์ฑ…์ด ๋„ˆ๋ฌด ๋งŽ์ด ๋ณ€ํ–ˆ๋‹ค๊ณ  ํŒ๋‹จํ•˜์—ฌ ํŽ˜๋„ํ‹ฐ๋ฅผ ๋Š˜๋ฆฐ๋‹ค(ฮฒโ†ฮฒร—2\beta \leftarrow \beta \times 2).

์ด ๋ฐฉ์‹์€ ์ •์ฑ… ์—…๋ฐ์ดํŠธ ํฌ๊ธฐ๋ฅผ ๋ชฉํ‘œ ๋ฒ”์œ„ ์•ˆ์œผ๋กœ ์œ ๋„ํ•˜๋Š” ํ•ฉ๋ฆฌ์ ์ธ ๋Œ€์•ˆ์ด๋‹ค. ํ•˜์ง€๋งŒ ๋…ผ๋ฌธ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ์— ๋”ฐ๋ฅด๋ฉด, ์ด ์ ์‘ํ˜• KL ํŽ˜๋„ํ‹ฐ ๋ฐฉ์‹์€ ํด๋ฆฌํ•‘์„ ์‚ฌ์šฉํ•œ ์ฃผ๋œ PPO ๋ฐฉ์‹๋ณด๋‹ค ์ „๋ฐ˜์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ๋‹ค์†Œ ๋–จ์–ด์กŒ๋‹ค. ์ด ๋•Œ๋ฌธ์— ์˜ค๋Š˜๋‚  PPO๋ผ๊ณ  ํ•˜๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ํด๋ฆฌํ•‘ ๋ฐฉ์‹์„ ์˜๋ฏธํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค.

โœ… Advantage๋Š” ์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐํ• ๊นŒ? (GAE)

PPO์—์„œ๋Š” ๋ณดํ†ต GAE(Generalized Advantage Estimation)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์–ด๋“œ๋ฐดํ‹ฐ์ง€๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉฐ, ๋‹จ๊ธฐ ๋ณด์ƒ๋งŒ์ด ์•„๋‹ˆ๋ผ, ๋ฏธ๋ž˜ ๋ณด์ƒ๊นŒ์ง€ ๊ณ ๋ คํ•œ ์ข€ ๋” ์ •๊ตํ•œ ํ‰๊ฐ€ ๋ฐฉ์‹์ด๋‹ค.

๐Ÿ“Œ A^t=โˆ‘l=0โˆž(ฮณฮป)lโ‹…ฮดt+l\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \cdot \delta_{t+l}

๐Ÿ“Œ ฮดt=rt+ฮณVฯ•(st+1)โˆ’Vฯ•(st)\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

  • ฮณ\gamma: ํ• ์ธ์œจ (๋ฏธ๋ž˜ ๋ณด์ƒ์˜ ์ค‘์š”๋„)
  • ฮป\lambda: GAE ๊ณ„์ˆ˜ (ํŽธํ–ฅ vs ๋ถ„์‚ฐ ์ ˆ์ถฉ)

GAE๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์ง•์„ ๊ฐ€์ง„๋‹ค.

  • ฮป=0\lambda=0: ํŽธํ–ฅ์€ ๋†’์ง€๋งŒ, ๋ถ„์‚ฐ์€ ๋‚ฎ๋‹ค (TD ๋ฐฉ์‹์— ๊ฐ€๊นŒ์›€)
  • ฮปโ†’1\lambda\to1: ํŽธํ–ฅ์€ ๋‚ฎ์ง€๋งŒ, ๋ถ„์‚ฐ์€ ๋†’๋‹ค (Monte Carlo ๋ฐฉ์‹์— ๊ฐ€๊นŒ์›€)

์ด ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋ฉด Advantage๋ฅผ ๋” ์‹ ์ค‘ํ•˜๊ฒŒ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์–ด, ์ •์ฑ… ์—…๋ฐ์ดํŠธ์˜ ํ’ˆ์งˆ์ด ๋†’์•„์ง„๋‹ค.

๐ŸŽ“ ๋งˆ๋ฌด๋ฆฌ

์ง€๊ธˆ๊นŒ์ง€ PPO(Proximal Policy Optimization) ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•ต์‹ฌ ๊ฐœ๋…๊ณผ ์ˆ˜ํ•™์  ๊ตฌ์กฐ, ๊ทธ๋ฆฌ๊ณ  ์ •์ฑ… ์•ˆ์ •ํ™”๋ฅผ ์œ„ํ•œ Clipping ๊ธฐ๋ฒ•, ์ ์‘ํ˜• KL ํŽ˜๋„ํ‹ฐ์™€ Advantage ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•(GAE)์— ๋Œ€ํ•ด ์ •๋ฆฌํ•ด๋ณด์•˜๋‹ค.

์ด์ œ ์ด๋Ÿฌํ•œ ์ด๋ก ์„ ๋ฐ”ํƒ•์œผ๋กœ, ์‹ค์ œ ๊ฐ•ํ™”ํ•™์Šต ํ™˜๊ฒฝ์—์„œ PPO๊ฐ€ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€๋ฅผ ํ™•์ธํ•ด๋ณผ ์ฐจ๋ก€์ด๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„์—์„œ๋Š” Unity ML-Agents๋ฅผ ํ™œ์šฉํ•œ ์ž์œจ ์ฃผ์ฐจ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์„ ๊ตฌ์„ฑํ•˜๊ณ , PPO ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•˜์—ฌ ์—์ด์ „ํŠธ๊ฐ€ ์‹ค์ œ๋กœ ์žฅ์• ๋ฌผ์„ ํ”ผํ•ด๊ฐ€๋ฉฐ, ์ฃผ์ฐจ ๊ณต๊ฐ„์— ์ •ํ™•ํžˆ ์ •์ฐจํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต์‹œํ‚ค๋Š” ๊ณผ์ •์„ ์‹คํ—˜ํ•  ์˜ˆ์ •์ด๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€