4. Off-policy Policy Gradient

이은상·2024년 10월 13일

Taxonomy of RL algorithm


여기서

  • Policy Optimization : on-policy RL
  • Q-Learning : off-policy RL

Off-policy vs. On-policy


Policy Gradient is On-policy

algorithm

  1. sample {γi}\{\gamma^i\} from π(atst)\pi (a_t|s_t) (run it on the robot)
    skip 불가능!
  2. θJ(θ)i(tθlogπθ(atisti))(tr(sti,ati))\bigtriangledown_\theta J(\theta) \approx \sum_i (\sum_t \bigtriangledown_\theta \log \pi_\theta (a_t^i|s_t^i)) (\sum_t r(s_t^i,a_t^i))
  3. θθ+αθJ(θ)\theta \leftarrow \theta + \alpha\bigtriangledown_\theta J(\theta)

pros

  • 논리적으로는 언젠가 수렴한다는 것을 보장

cons

  • Neural networks는 각 gradient step마다 조금씩만 변화
    \because NN은 non-linear해서 크게 변화하면 optimize가 불가능해짐
  • On-policy learning은 extremely inefficient해질 수 있음
    위의 이유로 인해서

Off-policy learning & importance sampling

sampling from the different distribution을 하고자 함

이렇게 density가 높은 부분을 더 많이 샘플링 하려고 함

importance sampling

Exp(x)[f(x)]=p(x)f(x)dx   =q(x)q(x)p(x)f(x)dx   =q(x)p(x)q(x)f(x)dx   =Exq(x)[p(x)q(x)f(x)]E_{x\sim p(x)}[f(x)] = \int p(x)f(x)dx \\ \quad\quad\quad\quad\space\space\space\quad= \int\frac{q(x)}{q(x)}p(x)f(x)dx \\ \quad\quad\quad\quad\quad\space\space\space = \int q(x)\frac{p(x)}{q(x)}f(x)dx \\ \quad\quad\quad\space\space\space\quad\quad= E_{x\sim q(x)} \big[\frac{p(x)}{q(x)} f(x)\big]

이 importance sampling을 통해
different distribution을 sample함


이렇게 uniform sampling과 달리 importance sampling은 p(x)p(x)의 모양을 따라서 sampling함
p(x)p(x)가 높은 부분이 더 sampling되도록
이때, p(x)p(x)는 original distribution that we want to learn

Policy gradient에 적용

objective function of policy gradient

θ=argmaxθJ(θ)\theta^* = \underset{\theta}{\mathrm{argmax}}J(\theta)
J(θ)=Eτpθ(τ)[r(τ)]J(\theta) = E_{\tau\sim p_\theta(\tau)}[r(\tau)]

우리가 만약 pθ(τ)p_\theta(\tau)에 대한 샘플을 가지고 있지 않으면 어떡할까?
And we have samples from some pˉ(τ)\bar{p}(\tau) instead
pˉ(τ):q(x)\bar{p}(\tau): q(x) - importance sampling distribution

importance sampling 식 사용해서 objective function 변경

J(θ)=Eτpˉ(τ)[pθ(τ)pˉ(τ)r(τ)]J(\theta) = E_{\tau\sim\bar{p}(\tau)}\big[\frac{p_\theta(\tau)}{\bar{p}(\tau)}r(\tau)\big]

pθ(τ)=p(s1)ΠTt=1πθ(atst)p(st+1st,at)p_\theta(\tau) = p(s_1)\underset{t=1}{\overset{T}\Pi}\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)

pθ(τ)pˉ(τ)=p(s1)ΠTt=1πθ(atst)p(st+1st,at)p(s1)ΠTt=1πˉ(atst)p(st+1st,at)=ΠTt=1πθ(atst)ΠTt=1πˉ(atst)\frac{p_\theta(\tau)}{\bar{p}(\tau)} = \frac{p(s_1)\underset{t=1}{\overset{T}\Pi}\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{ p(s_1)\underset{t=1}{\overset{T}\Pi}\bar{\pi}(a_t|s_t)p(s_{t+1}|s_t,a_t)} = \frac{\underset{t=1}{\overset{T}\Pi}\pi_\theta(a_t|s_t)}{\underset{t=1}{\overset{T}\Pi}\bar{\pi}(a_t|s_t)}
πˉ\bar{\pi} : policy that we are sampling on
약분된 부분들은 같은 environmnet에서 왔으니 같은 값들이라서 cancle될 수 있었던 것임 수 있었던 것임

다시 최종 식을 보자면,

  • 분모 : importance sampling에 따른 poicy
  • 분자 : 우리가 끌어내고 싶은 policy

Deriving policy gradient with importance sampling

J(θ)=Eτpˉ(τ)[pθ(τ)pˉ(τ)r(τ)]J(\theta) = E_{\tau\sim\bar{p}(\tau)}\big[\frac{p_\theta(\tau)}{\bar{p}(\tau)}r(\tau)\big]


위의 식에서 새로운 파라미터 θ\theta'를 estimate하는 방법

J(θ)=Eτpθ(τ)[pθ(τ)pθ(τ)r(τ)]J(\theta') = E_{\tau\sim p_\theta(\tau)}\big[\frac{p_{\theta'}(\tau)}{p_\theta(\tau)}r(\tau)\big]
pθ(τ)p_{\theta'}(\tau) : the only bit that depends on θ\theta' that is the off-policy policy gradient parameter

θJ(θ)=Eτpθ(τ)[θpθ(τ)pθ(τ)r(τ)]=Eτpθ(τ)[pθ(τ)θlogpθ(τ)pθ(τ)r(τ)]\bigtriangledown_{\theta'}J(\theta') = E_{\tau\sim p_\theta(\tau)}\big[\frac{\bigtriangledown_{\theta'}p_{\theta'}(\tau)}{p_\theta(\tau)}r(\tau)\big] = E_{\tau\sim p_\theta(\tau)}\big[\frac{p_{\theta'}(\tau)\bigtriangledown_{\theta'}\log p_\theta(\tau)}{p_\theta(\tau)}r(\tau)\big]
θ=θ\theta = \theta'이 되는 경우 약분 가능

The off-policy policy gradient

θJ(θ)=Eτpθ(τ)[pθ(τ)θlogpθ(τ)pθ(τ)r(τ)]when θθ=Eτpθ(τ)[(ΠTt=1πθ(atst)πθ(atst))(Tt=1θlogπθ(atst))(Tt=1r(st,at))]\bigtriangledown_{\theta'}J(\theta') = E_{\tau\sim p_\theta(\tau)}\big[\frac{p_{\theta'}(\tau)\bigtriangledown_{\theta'}\log p_\theta(\tau)}{p_\theta(\tau)}r(\tau)\big] \quad \text{when} \space \theta\neq\theta' \\ \quad\quad\quad\quad = E_{\tau\sim p_\theta(\tau)}\big[(\underset{t=1}{\overset{T}{\Pi}}\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)})(\underset{t=1}{\overset{T}{\sum}}\bigtriangledown_{\theta'}\log \pi_{\theta'}(a_t|s_t))(\underset{t=1}{\overset{T}{\sum}}r(s_t,a_t))\big]

causality 적용하면
θJ(θ)=Eτpθ(τ)[Tt1θlogπθ(atst)(Πtt=1πθ(atst)πθ(atst))(Tt=tr(st,at)(Πtt=tπθ(atst)πθ(atst)))]\bigtriangledown_{\theta'}J(\theta') = E_{\tau\sim p_\theta(\tau)}\big[\underset{t-1}{\overset{T}{\sum}}\bigtriangledown_{\theta'}\log\pi_{\theta'}(a_t|s_t)(\underset{t'=1}{\overset{t}{\Pi}}\frac{\pi_{\theta'}(a_{t'}|s_{t'})}{\pi_\theta(a_{t'}|s_{t'})})(\underset{t'=t}{\overset{T}{\sum}}r(s_{t'}, a_{t'})(\underset{t''=t}{\overset{t'}{\Pi}}\frac{\pi_{\theta'}(a_{t''}|s_{t''})}{\pi_\theta(a_{t''}|s_{t''})}))\big]

  • Πtt=1πθ(atst)πθ(atst)\underset{t'=1}{\overset{t}{\Pi}}\frac{\pi_{\theta'}(a_{t'}|s_{t'})}{\pi_\theta(a_{t'}|s_{t'})} : future actions don't affect current weight 반영
  • Πtt=tπθ(atst)πθ(atst)\underset{t''=t}{\overset{t'}{\Pi}}\frac{\pi_{\theta'}(a_{t''}|s_{t''})}{\pi_\theta(a_{t''}|s_{t''})} : 이 항을 무시하면, policy iteration algorithm을 얻게 됨


그런데 밑의 식에서,
θJ(θ)=Eτpθ(τ)[Tt=1θlogπθ(atst)(ΠTt=1πθ(atst)πθ(atst))(Tt=1r(st,at))]\bigtriangledown_{\theta'}J(\theta') = E_{\tau\sim p_\theta(\tau)}\big[\underset{t=1}{\overset{T}{\sum}}\bigtriangledown_{\theta'}\log \pi_{\theta'}(a_t|s_t)(\underset{t=1}{\overset{T}{\Pi}}\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)})(\underset{t=1}{\overset{T}{\sum}}r(s_t,a_t))\big]

πθ(atst)πθ(atst)<1\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)} < 1이기 때문에, 계속 곱하다보면 값이 엄청 작아지게 됨
θ\theta에서 sampling하기 때문에 πθ\pi_\thetaπθ\pi_{\theta'}보다 클 수밖에 없음
→ trajectory가 길다면, 뒤의 action은 gradient가 잘 흘러가지 못하게 됨

let's write the objective a bit differently

  • on-policy policy gradient
     θJ(θ)1NNi=1Tt=1θlogπθ(ai,tsi,t)Q^i,t\space \bigtriangledown_\theta J(\theta) \approx \frac{1}{N}\underset{i=1}{\overset{N}{\sum}}\underset{t=1}{\overset{T}{\sum}}\bigtriangledown_\theta\log\pi_\theta(a_{i,t}|s_{i,t})\hat{Q}_{i,t}
  • off-policy policy gradient
     θJ(θ)1NNi=1Tt=1πθ(si,t,ai,t)πθ(si,t,ai,t)θlogπθ(ai,tsi,t)Q^i,t 1NNi=1Tt=1πθ(si,t)πθ(si,t)πθ(ai,tsi,t)πθ(ai,tsi,t)θlogπθ(ai,tsi,t)Q^i,t\space \bigtriangledown_{\theta'} J(\theta') \approx \frac{1}{N}\underset{i=1}{\overset{N}{\sum}}\underset{t=1}{\overset{T}{\sum}} \frac{\pi_{\theta'}(s_{i,t}, a_{i,t})}{\pi_\theta(s_{i,t}, a_{i,t})} \bigtriangledown_{\theta'}\log\pi_{\theta'}(a_{i,t}|s_{i,t})\hat{Q}_{i,t} \\ \quad\quad\quad\quad\ \approx \frac{1}{N}\underset{i=1}{\overset{N}{\sum}}\underset{t=1}{\overset{T}{\sum}}\frac{\pi_{\theta'}(s_{i,t})}{\pi_\theta(s_{i,t})}\frac{\pi_{\theta'}(a_{i,t}|s_{i,t})}{\pi_{\theta}(a_{i,t}|s_{i,t})}\bigtriangledown_{\theta'}\log\pi_{\theta'}(a_{i,t}|s_{i,t})\hat{Q}_{i,t}
    • 이때, suppose πθ(st)=πθ(st)\pi_{\theta'}(s_t) = \pi_\theta(s_t) and ignore πθ(si,t)πθ(si,t)\frac{\pi_{\theta'}(s_{i,t})}{\pi_\theta(s_{i,t})}
      sometimes work in practice, when important sampling has same state distribution with the policy

0개의 댓글