Silver RL (5) Model-Free Control

Sanghyeok Choi·2021년 12월 29일
0

Intro_to_RL

목록 보기
5/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 5: Model-Free Control (Youtube) 강의 내용을 정리했습니다.

Introduction

Model-Free Reinforcement Learning

  • 지난 시간에 Model-free prediction을 배움
    (Estimate the value function of an unknown MDP)
  • 이번 시간엔 Model-free control을 배울 예정
    (Optimize the value function of an unknown MDP)
  • Uses of Model Free Control
    • MDP model is unknown, but experience can be sampled
      (e.g. Portfolio management, Game of Go)
    • MDP model is known, but is too big to use, except by samples
      (e.g. Helicopter, Ship Steering)

On and Off-Policy Learning

  • On-policy learning
    • "Learn on the job"
    • Learn about policy π\pi from experience sampled from π\pi
  • Off-policy learning
    • "Look over someone's shoulder"
    • Learn about policy π\pi form experience sampled from μ\mu
      (e.g. 로봇을 움직이기 위해 사람의 움직임(=better policy)을 학습!)

On-Policy Monte-Carlo Control

Generalised Policy Iteration (Lec. 3)

  • Recall!

Image from: here

  • Model is known
    • Policy Evaluation: Estimate vπ\mathrm{v}_\pi
    • Policy Improvement: Generate ππ\pi' \geq \pi

Generalised Policy Iteration With Monte-Carlo Evaluation

  • Now, Model is unknown!
    • Policy Evaluation: MC policy evaluation (4장), V=vπV=\mathrm{v}_\pi?
      No! V는 sample로부터 얻었기 때문에 vπ\mathrm{v}_\pi와 같다는 게 보장되지 않음! (approximation이다.)
    • Policy Improvement: Greedy policy improvement?

Model-Free Policy Iteration Using Action-Value Function

  • Greedy policy improvement over V(s)V(s) requires model of MDP
    • Recall, π(s)=arg maxaA[Rsa+γsSPssaV(s)]\pi'(s)=\argmax\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}V(s')]
      Note: Rsa\mathcal{R}^a_s & Pssa\mathcal{P}^a_{ss'}를 안다     \iff MDP를 안다.
    • P\mathcal{P}도 sample을 통해 근사할 수 있지만(relative frequency로), too expensive! (state×\timesaction에 대해 근사시켜야 하기 때문) 게다가 이렇게 하는 건 Model을 build 하는 것이기 때문에 Model-Free의 취지에 맞지 않다!
    • V(s)V(s')는 상대적으로 부정확한 추정치(for vπ(s)v_{\pi}(s'))이다.
      \rarr P\mathcal{P}도 추정치, VV도 추정치라서 곱해졌을 때 error가 더 크다. (내 생각)
  • Greedy policy improvement over Q(s,a)Q(s,a) is model-free!
    • π(s)=arg maxaAQ(s,a)\pi'(s)=\argmax\limits_{a\in\mathcal{A}}Q(s,a)
    • Here, Q(s,a)Q(s,a) is estimated action value function
      Q(s,a)=1N(s,a)GtQ(s,a)=\frac{1}{N(s,a)}\sum{G_t}
      Note: GtG_t is an empirical return, with Rt+1=RsaR_{t+1}=R_{s}^{a}
    • Q(s,a)Q(s,a)는 상대적으로 정확한 추정치(for qπ(s,a)q_{\pi}(s,a))이다.
      (Rt\because R_t가 state StS_t에서의 action AtA_t로부터 얻어지기 때문)
  • 정리하자면, Control에서는 state value function V(s)V(s)를 쓰면 improvement를 하기가 어렵다.
    그래서 action value function Q(s,a)Q(s,a)를 사용해서 evaluate하고, improvement를 한다.

Generalised Policy Iteration with Action-Value Function

Image from: here

  • Policy Evaluation: Monte-Carlo policy evaluation Q=qπQ = q_\pi
    • Lec. 4에서 배웠던 policy evaluation을 Q에 대해 적용한다.
    • Or using incremental mean, for each state StS_t and action AtA_t in an episode,
      N(St,At)N(St,At)+1N(S_t,A_t) \gets N(S_t,A_t)+1
      Q(St,At)Q(St,At)+1N(St,At)(GtQ(St,At))Q(S_t,A_t) \gets Q(S_t,A_t)+\frac{1}{N(S_t,A_t)}(G_t-Q(S_t,A_t))
  • Policy Improvement: Greedy policy improvement?
    • With greedy policy, we can stuck into local optimum.
      (MC에서 sampling 되지 않은 (s,a) pair가 optimal일 수도 있기 때문!)
    • We need to make sure we continue to see everything!

Exploration: ϵ\epsilon-Greedy Policy Improvement

  • With prob. 1ϵ1-\epsilon, choose the greedy action
    With prob. ϵ\epsilon, choose an action at random
  • π(as)={ϵ/m+1ϵ      if  a=arg maxaAQ(s,a)ϵ/m                    otherwise\pi(a|s)=\begin{cases} \epsilon/m+1-\epsilon \space\space\space\space\space\space if \space\space a^*=\argmax\limits_{a\in\mathcal{A}}Q(s,a)\\ \epsilon/m \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space otherwise \end{cases}

Theorem
For any ϵ\epsilon-greedy policy π\pi, the ϵ\epsilon-greedy policy π\pi' w.r.t. qπq_\pi is an improvement,
i.e., vπ(s)vπ(s)v_{\pi'}(s) \geq v_{\pi}(s)

Proof)
qπ(s,π(s))=aAπ(as)qπ(s,a)                    =ϵ/maA[qπ(s,a)+(1ϵ)maxaAqπ(s,a)]                    ϵ/maA[qπ(s,a)+(1ϵ)aAπ(as)ϵ/m1ϵqπ(s,a)]                    =aAπ(as)qπ(s,a)=vπ(s)q_\pi(s, \pi'(s))=\sum\limits_{a\in\mathcal{A}}\pi'(a|s)q_\pi(s,a)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space = \epsilon/m\sum\limits_{a\in\mathcal{A}}[q_\pi(s,a)+(1-\epsilon)\max\limits_{a\in\mathcal{A}}q_\pi(s,a)]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space \geq \epsilon/m\sum\limits_{a\in\mathcal{A}}[q_\pi(s,a)+(1-\epsilon)\sum\limits_{a\in\mathcal{A}}\frac{\pi(a|s)-\epsilon/m}{1-\epsilon}q_\pi(s,a)]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space = \sum\limits_{a\in\mathcal{A}}\pi(a|s)q_\pi(s,a) = v_\pi(s)
vπ(s)vπ(s)\therefore v_{\pi'}(s) \geq v_\pi(s) ... as vπ(s)qπ(s,π(s))v_\pi(s) \geq q_\pi(s, \pi'(s)) from policy improvement theorem(Lec.3).

Monte-Carlo Policy Iteration

Image from: here

  • Policy Evaluation: Monte-Carlo policy evaluation Q=qπQ = q_\pi
  • Policy Improvement: ϵ\epsilon-greedy policy improvement

Monte-Carlo Policy Control

Image from: here

  • 정확한 QQ를 얻기 위해서는 many episodes을 sample 해야한다.
  • But 우리의 목적은 optimal solution을 찾는 것이므로, episode-by-episode 방식으로 update 하는 게 더 효율적이다.
  • At Every episode
    • Policy evaluation: Monte-Carlo policy evaluation, QqπQ \approx q_\pi
    • Policy improvement: ϵ\epsilon-greedy policy improvement

GLIE

Greedy in the Limit with Infinite Exploration (GLIE) property

  • All state-action pairs are explored infinitely many times
    limkNk(s,a)=\lim\limits_{k\to\infin}N_k(s,a)=\infin
  • The policy converges on a greedy policy
    limkπk(as)=1(a=arg maxaAQk(s,a))\lim\limits_{k\to\infin}\pi_k(a|s)=\bold{1}(a=\argmax\limits_{a'\in\mathcal{A}}Q_k(s,a'))
  • GLIE property를 만족하도록 학습해야 한다!
    • ϵ\epsilon-greedy is GLIE if ϵ\epsilon reduces to zero at ϵk=1k\epsilon_k=\frac{1}{k}

GLIE Monte-Carlo Control

  • Sample kkth episode using π\pi: {S1,A1,R2,...,ST}π\{S_1,A_1,R_2,...,S_T\}\sim\pi
  • For each state StS_t and action AtA_t in the episode,
    N(St,At)N(St,At)+1N(S_t,A_t) \gets N(S_t,A_t)+1
    Q(St,At)Q(St,At)+1N(St,At)(GtQ(St,At))Q(S_t,A_t) \gets Q(S_t,A_t)+\frac{1}{N(S_t,A_t)}(G_t-Q(S_t,A_t))
  • Improve policy based on new action-value function
    ϵ1/k\epsilon \gets 1/k
    πϵ\pi \gets \epsilon-greedy(Q)(Q)

    Theorem
    GLIE Monte-Carlo control converges to the optimal action-value function,
    i.e., Q(s,a)q(s,a)Q(s,a) \to q_*(s,a)

On-Policy Temporal-Difference Learning

MC vs. TD Control

  • Advantages of TD learning over MC
    • Lower variance
    • Can learn online (Incomplete sequences)
  • Natural Idea: use TD instead of MC in our control loop
    • Apply TD to compute Q(S,A)Q(S,A)
    • Use ϵ\epsilon-greedy policy improvement

Updating Action-Value Functions with Sarsa

  • Lec. 4에서 TD로 VV를 estimate 했다.
    Sarsa는 TD로 QQ를 estimate 하는 것!
  • Sarsa:
    • (S,A)(S,A) ... 현재 state & action
    •    RR ......... reward
    •    SS' ........ 다음 state
    •    AA' ........ 다음 action (determined by some policy π\pi)
  • Q(S,A)Q(S,A)+α(R+γQ(S,A)Q(S,A))Q(S,A) \gets Q(S,A)+\alpha(R+\gamma{Q(S',A')}-Q(S,A))
    • Here,
      Q(S,A)Q(S,A) is an estimation before taking a step
      R+γQ(S,A)R+\gamma{Q(S',A')} is an estimation after taking a step
  • MC control에서 Q를 구하는 방법만 위와 같이 바꾸면 Sarsa가 된다!

On-Policy Control With Sarsa

  • Every time-step (MC는 every episode)
    • Policy Evaluation: Sarsa, QqπQ \approx q_\pi
    • Policy Improvement: ϵ\epsilon-greedy policy improvement
  • Sarsa Algorithm for On-Policy Control
    Image from: here
    • Note: Q(s,a)Q(s,a)s×as\times a table!

Theorem (Convergence of Sarsa)
Sarsa converges to the optimal action-value function under the following conditions:

  • GLIE sequence of policies πt(as)\pi_t(a|s)
  • Robbins-Monro sequence of step-sizes αt\alpha_t
    t=1αt=\sum\limits_{t=1}^{\infin}\alpha_t=\infin
    t=1αt2<\sum\limits_{t=1}^{\infin}\alpha_t^2\lt\infin
    => Step size should be not too small or big!

n-Step Sarsa

  • 앞에서 본 TD-target은 1-step Q-return!
  • Consider n-step returns for n=1,2,...,n=1,2,...,\infin
    n=1     qt(1)=Rt+1+γQ(St+1,At+1)n=1 \space\space\space\space\space q_t^{(1)}=R_{t+1}+\gamma{Q(S_{t+1},A_{t+1})}
    n=2     qt(2)=Rt+1+γRt+2+γ2Q(St+2,At+2)n=2 \space\space\space\space\space q_t^{(2)}=R_{t+1}+\gamma{R_{t+2}}+\gamma^2{Q(S_{t+2},A_{t+2})}
                   \vdots \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space \vdots
    n=   qt()=Rt+1+γRt+2+...+γTt1RTn=\infin \space\space\space q_t^{(\infin)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{T-t-1}R_T
  • Define the n-step Q-return
    qt(n)=Rt+1+γRt+2+...+γn1Rt+n+γnQ(St+n,At+n)q_t^{(n)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{n-1}{R_{t+n}}+\gamma^n{Q(S_{t+n},A_{t+n})}
    이게 TD-target이 된다!
  • n-step Sarsa updates Q(s,a)Q(s,a) towards the n-step QQ-return
    Q(St,At)Q(St,At)+α(qt(n)Q(St,At))Q(S_t,A_t) \gets Q(S_t,A_t)+\alpha(q_t^{(n)}-Q(S_t,A_t))

Forward View Sarsa(λ\lambda)

  • The qtλq^\lambda_t return combines all n-step QQ-returns qt(n)q^{(n)}_t
  • Using weight (1λ)λn1(1-\lambda)\lambda^{n-1}
    qtλ=(1λ)n=1λn1qt(n)q^\lambda_t=(1-\lambda)\sum\limits_{n=1}^{\infin}\lambda^{n-1}q_t^{(n)}
  • Forward-view Sarsa(λ\lambda)
    Q(St,At)Q(St,At)+α(qtλQ(St,At))Q(S_t,A_t) \gets Q(S_t,A_t)+\alpha(q^\lambda_t-Q(S_t,A_t))

Backward View Sarsa(λ\lambda)

  • TD(λ\lambda)에서처럼 eligibility traces를 이용한다! (online인 경우에)
    But Sarsa(λ\lambda) has one eligibility trace for each state-action pair
    E0(s,a)=0E_0(s,a)=0
    Et(s,a)=γλEt1(s,a)+1(St=s,At=a)E_t(s,a)=\gamma\lambda{E_{t-1}(s,a)}+\bold{1}(S_t=s,A_t=a)
    매 step, 모든 (s,a)(s,a)에 대해 eligibility trace를 update 해준다.
  • For TD-error δt\delta_t and eligibility trace Et(s,a)E_t(s,a),
    δt=Rt+1+γQ(St+1,At+1)Q(St,At)\delta_t=R_{t+1}+\gamma{Q(S_{t+1},A_{t+1})}-Q(S_t,A_t)
    Q(s,a)Q(s,a)+αδtEt(s,a)Q(s,a) \gets Q(s,a)+\alpha\delta_tE_t(s,a)
  • Sarsa(λ\lambda) Algorithm
    Image from: here
    • Note: Q(s,a),E(s,a)Q(s,a), E(s,a)는 각각 s×as\times a table!

Sarsa(λ\lambda) Gridworld Example

Image from: here

  • QQ를 모두 0으로 initialize 했다면, one-step Sarsa의 경우 Q(S,)Q(S', \uparrow)만 update 된다. (SS'는 목적지 바로 아래에 위치하는 state, 목적지에 도착하면 reward를 받는 상황)
  • Sarsa(λ\lambda)는 E(s,a)E(s,a)에서 모든 state와 action에 대한 weight를 저장하고 있고 이 weight만큼 모든 s,as, a에 대해 Q(s,a)Q(s,a)를 update 하기 때문에 마지막에만 reward가 발생해도 path의 모든 QQ들이 적절히 update 된다.

Off-Policy Learning

  • Evaluate target policy π(as)\pi(a|s) to compute vπ(s)v_\pi(s) or qπ(s,a)q_\pi(s,a) while following behaviour policy μ(as)\mu(a|s)
    {S1,A1,R2,...,ST}μ\{S_1,A_1,R_2,...,S_T\}\sim\mu

Importance Sampling

  • EXP[f(X)]=P(X)f(X)                       =Q(X)P(X)Q(X)f(X)                       =EXQ[P(X)Q(X)f(X)]\mathbb{E}_{X \sim P}[f(X)]=\sum{P(X)f(X)}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum{Q(X)\cfrac{P(X)}{Q(X)}f(X)}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{X \sim Q}[\cfrac{P(X)}{Q(X)}f(X)]
  • Under PP에서 f(X)f(X)의 기댓값을 구하고자 하는데, PP에서 sampling을 하기 어려울 때, QQ의 sample을 대신 이용할 수 있다.

Importance Sampling for Off-Policy Monte-Carlo

  • GtG_t is generated from μ\mu
  • We want to evaluate π\pi using μ\mu
    To do this, we need to calculate Gtπ/μG_t^{\pi/\mu}
  • By importance sampling,
    Gtπ/μ=π(AtSt)π(At+1St+1)...π(AT1ST1)μ(AtSt)μ(At+1St+1)...μ(AT1ST1)GtG_t^{\pi/\mu}=\cfrac{\pi(A_t|S_t)\pi(A_{t+1}|S_{t+1})...\pi(A_{T-1}|S_{T-1})}{\mu(A_t|S_t)\mu(A_{t+1}|S_{t+1})...\mu(A_{T-1}|S_{T-1})}G_t
    • Note: Prob. of a trajectory
      (μ\mu로 Sampling 된 Trajectory의 확률을 각 policy에 대해 구한다.)
      • under π\pi
        P(At,St+1,At+1,...,STSt:T,At:T1π)=k=tT1π(AkSk)P(Sk+1Sk,Ak)P(A_t,S_{t+1},A_{t+1},...,S_T|S_{t:T},A_{t:T-1} \sim \pi)=\prod\limits_{k=t}^{T-1}{\pi(A_k|S_k)P(S_{k+1}|S_k,A_k)}
      • under μ\mu
        P(At,St+1,At+1,...,STSt:T,At:T1μ)=k=tT1μ(AkSk)P(Sk+1Sk,Ak)P(A_t,S_{t+1},A_{t+1},...,S_T|S_{t:T},A_{t:T-1} \sim \mu)=\prod\limits_{k=t}^{T-1}{\mu(A_k|S_k)P(S_{k+1}|S_k,A_k)}
    • P(Sk+1Sk,Ak)P(S_{k+1}|S_k,A_k)는 주어진 trajectory와 env에 의해 이미 정해져 있고, policy의 영향을 받지 않기 때문에 나눠서 없앨 수 있다.
      Gtπ/μ=k=tT1π(AkSk)k=tT1μ(AkSk)Gt\therefore G_t^{\pi/\mu}=\cfrac{\prod\limits_{k=t}^{T-1}{\pi(A_k|S_k)}}{\prod\limits_{k=t}^{T-1}{\mu(A_k|S_k)}}G_t
  • 이렇게 구한 Gtπ/μG_t^{\pi/\mu}는 곧 return under π\pi라고 할 수 있다.
  • Update value towards the corrected return Gtπ/μG_t^{\pi/\mu}
    V(St)V(St)+α(Gtπ/μV(St))V(S_t) \gets V(S_t)+\alpha(G_t^{\pi/\mu}-V(S_t))
  • Note:
    E[GtSt=s]=Vμ(s)\mathbb{E}[G_t|S_t=s]=V_\mu(s)
    E[Gtπ/μSt=s]=Vπ(s)\mathbb{E}[G_t^{\pi/\mu}|S_t=s]=V_\pi(s)
  • 문제점:
    • 분모에 들어가는 μ\mu값이 0이 되면 쓸 수 없다.
    • Variance가 dramatically 증가한다. (\because variance of a ratio is unbounded)

Importance Sampling for Off-Policy TD

  • TD targets are generated from μ\mu
  • We want to evaluate π\pi using μ\mu
  • For 1-step TD target, we only need a single importance sampling correction
    TD target from μ\mu: Rt+1+γV(St+1)R_{t+1}+\gamma{V(S_{t+1})}
    TD target corrected: π(AtSt)μ(AtSt)(Rt+1+γV(St+1))\cfrac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1}+\gamma{V(S_{t+1})})
  • Update value towards the corrected TD target
    V(St)V(St)+α(π(AtSt)μ(AtSt)(Rt+1+γV(St+1))V(St))V(S_t) \gets V(S_t) + \alpha\left({\cfrac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1}+\gamma{V(S_{t+1})})-V(S_t)}\right)
  • MC에 비해 much lower variance! (여전히 variance가 높아지긴 한다.)

Off-Policy Evaluation by Q-learning

  • Now consider off-policy learning of action-values Q(s,a)Q(s,a)
    (Policy evaluation by Q-learning)
  • No importance sampling is required
  • 실제 next action AtA_{t}μ\mu를 이용해 선택하지만,
    π\pi로 alternative(imaginary) action AA'을 선택해서 그 결과를 상상해본다.
    "If I took A', what might it look like?"
  • Update Q(St,At)Q(S_t,A_t) towards value of alternative action
    (π\pi로 상상한 결과가 QQ under policy π\pi에 더 가깝기 때문에 이 방향으로 update 해준다.)
    Q(St,At)Q(St,At)+α(Rt+1+γQ(St+1,A)Q(St,At))Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t))

Off-Policy Control by Q-learning

  • Behaviour policy(μ\mu) & Target policy(π\pi)를 둘 다 update!
    (물론 Behaviour policy를 update 할 수 없는 경우도 있지만...)
    • The target policy π\pi is greedy w.r.t. Q(s,a)Q(s,a)
      => π(St+1)=arg maxaQ(St+1,a)\pi(S_{t+1})=\argmax\limits_{a'}{Q(S_{t+1}, a')}
      • The Q-learning target then simplifies:
        Rt+1+γQ(St+1,A)R_{t+1}+ \gamma Q(S_{t+1}, A')
        =Rt+1+γQ(St+1,arg maxaQ(St+1,a))=R_{t+1}+ \gamma Q(S_{t+1}, \argmax\limits_{a'}{Q(S_{t+1}, a')})
        =Rt+1+maxaγQ(St+1,a)=R_{t+1}+ \max\limits_{a'} \gamma Q(S_{t+1}, a')
      • Q(St,At)Q(St,At)+α(Rt+1+γmaxaQ(St+1,a)Q(St,At))Q(S_t,A_t) \gets Q(S_t,A_t) + \alpha\left(R_{t+1}+\gamma\max\limits_{a'}{Q(S_{t+1}, a')}-Q(S_t,A_t)\right)
    • The behaviour policy μ\mu is e.g. ϵ\epsilon-greedy w.r.t. Q(s,a)Q(s,a)
      To make sure we visit all possible pairs of (s, a)

Theorem
Q-learning control converges to the optimal action-value function, i.e., Q(s,a)q(s,a)Q(s,a) \to q_*(s,a)

  • Q-Learning Algorithm for off-policy control
    Image from: here

Relationship Between DP and TD

  • Let XαYXX+α(YX)X \xleftarrow{\alpha}{Y} \equiv X \gets X + \alpha(Y-X)
    Image from: here
  • Note:
    • Sarsa로 On-policy Control 할 때, Sarsa는 Q-Policy evaluation에만 쓰이고, improvement는 (Sarsa로 얻은) QQ를 보고 ϵ\epsilon-greedy action을 취한다.
    • 반면 Q-learning으로 Off-policy Control을 할 때는 우리가 evaluate & improve 하는 policy π\pi로부터 action을 취하는 게 아니기 때문에 (μ\mu가 action을 준다.) explicit policy를 얻을 필요가 없다.
      => Value Iteration과 유사하게, 매 step마다 implicitly greedy policy를 선택하고 있다.

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

profile
Lazy Enthusiast

0개의 댓글

관련 채용 정보