Silver RL (4) Model-Free Prediction

Sanghyeok Choi·2021년 12월 28일
0

Intro_to_RL

목록 보기
4/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 4: Model-Free Prediction (Youtube) 강의 내용을 정리했습니다.

Introduction

  • 이전까지는 envirnment(MDP)를 알고 있다고 가정했다.
    즉, given π\pi에 대해 transition matrix Pπ\mathcal{P}^\pi와 reward Rπ\mathcal{R}^\pi가 알려져 있고, 이를 이용해 vπ\mathrm{v}_\pi, qπ\mathrm{q}_\pi를 구할 수 있었다. (Lec3 "Planning" by DP)
  • 지금부터는 environment(MDP)가 initially unknown!
    즉, Pπ\mathcal{P}^\piRπ\mathcal{R}^\pi를 모르기 때문에 다른 방법으로 vπ\mathrm{v}_\pi, qπ\mathrm{q}_\pi를 구해야 한다.
    => Model-Free Reinforcement Learning
  • Note: POMDP vs. initially unknown MDP
    • POMDP에서는 current states를 정확히 observe 할 수 없다. (belief state로 확률적으로 정의하게 된다.)
    • Initially unknown MDP에서는 미래를 정확히 예측할 수 없을 뿐이지, current states를 observe 할 수는 있다.
  • Model-free prediction
    • Estimate the value function of an unknown MDP (Policy is given!)
    • 지난 강의(Planning)에서의 Policy Evaluation에 해당함
  • Model-free control (다음 시간)
    • Optimize the value function of an unknown MDP
    • 지난 강의에서의 Policy Improvement에 해당함

Methods for Model-free predictions

  • Monte-Carlo Learning (MC)
    • Go all the ways to the end of a trajectory, and estimate the value by looking at sample return!
    • Inefficient but powerful
  • Temporal-Difference Learning (TD)
    • Go one step ahead and estimate the return from the step
    • More efficient
  • TD(λ\lambda)
    • Unify 2 and 3

Monte-Carlo Reinforcement Learning

  • Learn from complete episodes (No bootstrapping)
  • Assume that value = empirical mean return
  • Can be applied to episodic MDPs (all episodes must terminate)

Monte-Carlo Policy Evaluation

  • A policy π\pi is given
  • Learn vπ\mathrm{v}_\pi from the agent's episodes of experience under π\pi
    S1,A1,R2,...,SkπS_1,A_1,R_2,...,S_k \sim \pi
  • Recall,
    • Return (total discounted reward):
      Gt=Rt+1+γRt+2+γ2Rt+3+...+γTt1RTG_t=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...+\gamma^{T-t-1}R_T
      (우리가 episode를 갖고 있기 때문에 이 값은 구할 수 있다!)
    • Value function (the expected return):
      vπ(s)=E[GtSt=s]v_\pi(s)=\mathbb{E}[G_t|S_t=s]
  • MC uses empirical mean return instead of expected return

First-Visit Monte-Carlo Policy Evaluation

  • 여러 episodes를 갖고 있다고 하자.
  • To evaluate state s,
    • For episode in episodes:
      • Let t be the first time-step that state s is visited in an episod
        (t 이후에 s를 다시 방문할 수 있지만 t만 생각하자!)
      • Increment counter N(s)N(s)+1N(s) \gets N(s) + 1
      • Increment total return S(s)S(s)+GtS(s) \gets S(s) + G_t
    • Value is estimated by V(s)=S(s)/N(s)V(s)=S(s)/N(s)
    • By law of large numbers, V(s)vπ(s)V(s) \to v_\pi(s) as N(s)N(s) \to \infin

Every-Visit Monte-Carlo Policy Evaluation

  • 여러 episodes를 갖고 있다고 하자.
  • To evaluate state s,
    • For episode in episodes:
      • Let t be the every time-step that state s is visited in an episode
        (하나의 episode에서 s를 여러 번 방문하면 여러 번 update 될 수 있다.)
      • Increment counter N(s)N(s)+1N(s) \gets N(s) + 1
      • Increment total return S(s)S(s)+GtS(s) \gets S(s) + G_t
    • Value is estimated by V(s)=S(s)/N(s)V(s)=S(s)/N(s)
    • Again, V(s)vπ(s)V(s) \to v_\pi(s) as N(s)N(s) \to \infin

Example: Blackjack Value Function after Monte-Carlo Learning

  • States:
    • Current sum (12-21) ... 12 미만일 때는 무조건 한 장 더 받는 게 유리함
    • Dealer's showing card (ace-10)
    • Do I have a "useable" ace? (yes-no)
  • Action:
    • Stick: stop receiving cards (and terminate)
    • Twist: take another card
  • Reward:
    • Reward for stick:
      • +1 if sum of cards > sum of dealer cards
      • 0 if sum of cards = sum of dealer cards
      • -1 if sum of cards < sum of dealer cards
    • Reward for twist:
      • -1 if sum of cards > 21 (and terminate)
      • 0 otherwise (and select action again!)
  • Simple Policy: stick if sum of cards \geq 20, otherwise twist.
  • 이때의 value function을 MC로 구하면 아래 사진과 같다.

Image from: here

Incremental Mean

  • μk=1kj=1kxj      =1k(xk+j=1k1xj)      =1k(xk+(k1)μk1)      =μk1+1k(xkμk1)\mu_k=\frac{1}{k}\sum\limits_{j=1}^{k}\mathrm{x}_j\\ \space\space\space\space\space\space=\frac{1}{k}\Big(\mathrm{x}_k+\sum\limits_{j=1}^{k-1}\mathrm{x}_j\Big)\\ \space\space\space\space\space\space=\frac{1}{k}(\mathrm{x}_k+(k-1)\mu_{k-1})\\ \space\space\space\space\space\space=\mu_{k-1}+\frac{1}{k}(\mathrm{x}_k-\mu_{k-1})

    • μk1\mu_{k-1}: Previous mean estimate
    • 1k\frac{1}{k}: step size
    • xkμk1\mathrm{x}_k-\mu_{k-1}: Error term (actual xk\mathrm{x}_k과 previous estimate μk1\mu_{k-1}의 차이
    • \therefore correct the prediction to the direction of the error!

Incremental Monte-Carlo Updates

  • 앞에서는 여러 episodes를 갖고 있는 상태에서 한 번에 update!
  • Now, update V(s) every episode incrementally
    • For each state StS_t with return GtG_t
      • N(St)N(St)+1N(S_t) \gets N(S_t)+1
      • V(St)V(St)+1N(St)(GtV(St))V(S_t) \gets V(S_t)+\frac{1}{N(S_t)}(G_t-V(S_t))
    • In non-stationary problems, it can be useful to track a running mean, i.e., forget old episodes
      (RL에서는 policy를 계속 update하기 때문에 최신 value를 더 크게 반영하도록 running mean을 쓰는 게 더 적합하다!)
      • V(St)α(GtV(St))V(S_t) \gets \alpha(G_t-V(S_t))

Temporal-Difference Learning

  • Learn directly from incomplete episodes by bootstrapping
    • Bootstrapping: Substituting the remainder of the trajectory with an estimate of what would happen from that point onward.
  • Updates a guess towards a guess

MC and TD

  • Goal: Learn vπ\mathrm{v}_\pi online from experience under policy π\pi
  • Incremental every-visit MC:
    • V(St)V(St)+α(GtV(St))V(S_t) \gets V(S_t) + \alpha(G_t-V(S_t))
      Update V(St)V(S_t) toward GtG_t (actual return)
  • Simplest TD, TD(0):
    • V(St)V(St)+α[(Rt+1+γV(St+1))V(St)]V(S_t) \gets V(S_t) + \alpha[(R_{t+1}+ \gamma V(S_{t+1}))-V(S_t)]
      Update V(St)V(S_t) toward Rt+1+γV(St+1)R_{t+1}+ \gamma V(S_{t+1}) (estimated return)
    • Rt+1+γV(St+1)R_{t+1}+ \gamma V(S_{t+1}) is called the TD target
    • δt=Rt+1+γV(St+1)V(St)\delta_t=R_{t+1}+ \gamma V(S_{t+1}) - V(S_t) is called the TD error
    • TD target은 V(St)V(S_t)보다 one-step만큼 더 정확한 estimate!

Advantages and Disadvantages of MC vs. TD

  • TD can learn before knowing the final outcome
    MC must wait until the end of episode before return is known
  • TD can learn without the final outcome (can learn from incomplete sequences)
    TD works in continuing environments
    MC can only learn from complete sequences
    MC only works for episodic (terminating) environments
  • Bias/Variance Trade-off
    • Bias: MC < TD
      • Gt=Rt+1+γRt+2+γ2Rt+3+...+γTt1RTG_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...+\gamma^{T-t-1}R_{T} is unbiased estimate of vπ(St)v_\pi(S_t)
        (MC has zero bias => Good convergence properties)
      • TD target Rt+1+γV(St+1)R_{t+1}+\gamma V(S_{t+1}) is biased estimate of vπ(St)v_\pi(S_t)
        (Biased toward previous VV => TD is sensitive to initial value)
    • Variance: MC > TD
      • GtG_t depends on many random actions, transitions, and rewards
        (확률 변수를 더할 수록 variance는 커진다)
      • TD target depends on one random action, transition, and reward
        (Variance가 작다 => more efficient!)

Example: Random Walk, MC vs. TD

Image from: here

  • RMS error averaged over states: E((vestimatedvtrue)2)\sqrt{\mathbb{E}{((\mathrm{v}_{estimated}-\mathrm{v}_{true}})^2)}
  • TD가 더 빨리 수렴하는 것을 볼 수 있다. (TD is more efficient!)

Batch MC and TD

  • MC and TD converges: V(s)vπ(s)V(s) \to v_\pi(s) as #episodes \to \infin
  • For finite number of episodes?
    • (non-incremental) MC converges to solution with minimum MSE
      • Best fit to the observed returns
      • k=1Kt=1Tk(GtkV(stk))2\sum\limits_{k=1}^{K}\sum\limits_{t=1}^{T_k}(G_t^k-V(s_t^k))^2 ... minimize!
    • TD(0) converges to solution of maximum likelihood Markov model
      • Solution to the MDP S,A,P^,R^,γ\lang \mathcal{S},\mathcal{A},\mathcal{\hat{P}},\mathcal{\hat{R}},\gamma \rang that best fits the data
      • P^ssa=1N(s,a)k=1Kt=1Tk1(stk,atk,st+1k=s,a,s)\hat{P}^a_{ss'}=\frac{1}{N(s,a)}\sum\limits_{k=1}^{K}\sum\limits_{t=1}^{T_k}\bold{1}(s_t^k,a_t^k,s_{t+1}^k=s,a,s')
        Transition Prob = Relative frequency
      • R^ssa=1N(s,a)k=1Kt=1Tk1(stk,atk=s,a)rtk\hat{R}^a_{ss'}=\frac{1}{N(s,a)}\sum\limits_{k=1}^{K}\sum\limits_{t=1}^{T_k}\bold{1}(s_t^k,a_t^k=s,a)r_t^k
        Reward: k episodes 동안 등장한 (s,a)(s,a) 조합에 대한 reward의 평균
      • 즉, TD는 P^ssa,R^ssa\hat{P}^a_{ss'}, \hat{R}^a_{ss'}를 갖고 MDP를 implicitly build 한다!
  • MC vs. TD
    • TD exploits Markov property (Markov env에서 더 효율적)
    • MC does not exploit Markov property (non-Markov env에서 더 효율적)

Unified View

Bootstrapping and Sampling

  • Bootstrapping: update involves an estimate

    • MC does not bootstrap (MC uses actual return)
    • DP bootstraps
    • TD bootstraps
  • Sampling: update samples an expectation

    • MC samples
    • DP does not sample (DP considers all possibilities)
    • TD samples
  • Unified View of Reinforcement Learning

Image from: here

TD(λ\lambda)

n-Step Target

  • 앞에서 본 TD target: Rt+1+γV(St+1)R_{t+1}+ \gamma V(S_{t+1})
    이것은 현재 state의 바로 다음 state만 생각하는 1-step target이다.
  • Let TD target look n steps into the future!

Image from: here

  • n-step TD는 현재 state에서 n-step만큼 진행한 것을 sample로 생각한다.
    MC(=\infin-step TD)는 끝날 때까지 진행한 것을 sample로 생각한다.

n-Step Return

  • n=1     Gt(1)=Rt+1+γV(St+1)n=1 \space\space\space\space\space G_t^{(1)}=R_{t+1}+\gamma{V(S_{t+1})}
    n=2     Gt(2)=Rt+1+γRt+2+γ2V(St+2)n=2 \space\space\space\space\space G_t^{(2)}=R_{t+1}+\gamma{R_{t+2}}+\gamma^2{V(S_{t+2})}
                   \vdots \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space \vdots
    n=   Gt()=Rt+1+γRt+2+...+γTt1RTn=\infin \space\space\space G_t^{(\infin)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{T-t-1}R_T
  • Define the n-step return
    Gt(n)=Rt+1+γRt+2+...+γn1Rt+n+γnV(St+n)G_t^{(n)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{n-1}{R_{t+n}}+\gamma^n{V(S_{t+n})}
    이게 n-step TD target이 된다!
  • n-step TD learning
    V(St)V(St)+α(Gt(n)V(St))V(S_t) \gets V(S_t)+\alpha(G_t^{(n)}-V(S_t))

Large Random Walk Example

  • What is the best value for n?

Image from: here

  • 문제마다, setting마다 다르다.
    (optimal n is not a robust value!)

Averaging n-Step Returns

  • 여러 n값으로 return을 구해서 평균을 낼 수 있지 않을까?
    • e.g. average the 2-step and 4-step returns
      12G(2)+12G(4)\frac{1}{2}G^{(2)}+\frac{1}{2}G^{(4)}
    • Combines information from two different time-steps!

λ\lambda-return

  • t시점에서, 모든 n에 대한 return을 적절한 weight를 곱해 다 합쳐보자!
  • The λ\lambda-return GtλG^\lambda_t combines all n-step return Gt(n)G_t^{(n)}, using weight (1λ)λn1(1-\lambda)\lambda^{n-1}
  • Gtλ=(1λ)n=1λn1Gt(n)G^\lambda_t=(1-\lambda)\sum\limits_{n=1}^{\infin}\lambda^{n-1}G^{(n)}_t
  • Note: (1λ)n=1Tt1λn1=(1λ)1λTt11λ=1λTt11(1-\lambda)\sum\limits_{n=1}^{T-t-1}\lambda^{n-1}=(1-\lambda)\frac{1-\lambda^{T-t-1}}{1-\lambda}=1-\lambda^{T-t-1}\approx1
    => λ\lambda-return은 Gt(1),Gt(2),...,Gt(Tt1)G_t^{(1)},G_t^{(2)},..., G_t^{(T-t-1)}의 weighted average!
    => weight의 합이 정확히 1이 되도록 하기 위해서는 GtλG_t^\lambda를 구하는 식 마지막에 λTt1Gt(Tt1)\lambda^{T-t-1}G_t^{(T-t-1)}를 더해주면 된다.
  • Note2: λ=0\lambda=0 이면 TD(0), λ=1\lambda=1 이면 MC
    λ\lambda는 hyperparameter, Optimal을 찾아줘야 한다.

Image from: here

Forward View of TD(λ\lambda)

Image from: here

  • V(St)V(St)+α(GtλV(St))V(S_t) \gets V(S_t) + \alpha\left(G_t^{\lambda}-V(S_t)\right)
    where, Gtλ=(1λ)n=1λn1Gt(n)G_t^{\lambda}=(1-\lambda)\sum\limits_{n=1}^{\infin}\lambda^{n-1}G_t^{(n)}
  • Update value function towards the λ\lambda-return
  • Disadvantages
    • Like MC, can only be computed from complete episodes
    • 멀리 있는 states들은 여러 번 계산되기 때문에 비효율적!

Backward View of TD(λ\lambda)

  • 멀리 있는 states의 return Gt(n)G_t^{(n)}은 현재에 적게 반영된다.
    (weight: (1λ)λn1(1-\lambda)\lambda^{n-1}, n이 커질수록 weight는 작아진다.)
  • 한편 future return으로 current state를 update 하는 걸 반대로 생각하면, 현재의 return으로 과거를 update 한다고 말할 수 있다. => Backward View of TD(λ\lambda)
    • 이때 가까운 과거의 states는 더 크게, 먼 과거의 states는 더 작게 update 해야 한다.
    • Eligibility traces를 이용해서 구현한다.
  • Incremental mechanism for approximating the forward view TD(λ\lambda)
  • We can update online, every step, from incomplete sequences
    Recall, forward view of TD(λ\lambda) requires complete episodes!

Eligibility Traces

  • 현재로 과거를 update 할 때, 어떤 state s에 대해,
    • 1) Frequency heuristic: 과거에 s를 많이 방문했으면 weight 증가
    • 2) Recency heuristic: 최근에 s를 방문했으면 weight 증가- Frequency
  • Eligibility traces combine both heuristics!
  • sS,\forall s \in \mathcal{S},
    • E0(s)=0E_0(s)=0
    • Et(s)=γλEt1(s)+1(St=s)E_t(s)=\gamma\lambda{E_{t-1}(s)}+\bold{1}(S_t=s)

Image from: here

Backward View TD(λ\lambda)

  • 매 step, 모든 s에 대해 eligibility trace를 update 해준다.
  • Notation
    Et(s)E_t(s): Eligibility trace of state s, at time t
    δt\delta_t: TD-error (1-step return)
  • δt=Rt+1+γV(St+1)V(St)\delta_t=R_{t+1}+\gamma{V(S_{t+1})}-V(S_t)
  • V(s)V(s)+αδtEt(s),sSV(s) \gets V(s) + \alpha\delta_tE_t(s), \forall s \in \mathcal{S}
  • EtE_t 덕분에 현재의 TD-target(1-step return)으로 모든 s에 대해 적절한 크기의 update를 할 수 있다!
  • TD(λ\lambda) and TD(0)
    (EtE_t 식에서) λ=0\lambda=0이면 (i.e., EtE_t가 없으면) TD(0)와 같다.
    (그래서 TD(0)라고 부른다!)
    => t 시점에 방문한 state만 update
  • TD(λ\lambda) and MC
    λ=1\lambda=1이고 (γ\gamma가 유일한 discounting factor) offline update를 하면 MC와 같다.
    (i.e., TD(1)=MC when TD updates the value function offline)
    => Credit is deferred until end of the episode

Relationship Between Forward and Backward TD

Theorem
The sum of offline updates is identical for forward-view and backward-view TD(λ\lambda)
For given state ss,
t=1TαδtEt(s)=t=1Tα(GtλV(St))1(St=s)\sum\limits_{t=1}^T{\alpha\delta_tE_t(s)}=\sum\limits_{t=1}^T{\alpha\left(G_t^\lambda-V(S_t)\right)\bold{1}(S_t=s)}
좌변이 backward view TD(λ\lambda), 우변이 forward view TD(λ\lambda)
자세한 내용은 슬라이드 44-48 참고!

  • (offline) TD(1) is exactly equivalent to every-visit MC
    (online) TD(1) is roughly equivalent to every-visit MC
  • *Note: Exact Online TD(λ\lambda) achieves perfect equivalence by using a slightly different form of eligibility trace(Sutton and von Seijen, ICML 2014)*

Summary of Forward and Backward TD(λ\lambda)

Image from: here


혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

profile
Lazy Enthusiast

0개의 댓글

관련 채용 정보