Silver RL (4) Model-Free Prediction

Sanghyeok Choi·2021년 12월 28일

Intro_to_RL

목록 보기

4/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 4: Model-Free Prediction (Youtube) 강의 내용을 정리했습니다.

Introduction

이전까지는 envirnment(MDP)를 알고 있다고 가정했다.
즉, given $\pi$ 에 대해 transition matrix $\mathcal{P}^\pi$ 와 reward $\mathcal{R}^\pi$ 가 알려져 있고, 이를 이용해 $\mathrm{v}_\pi$ , $\mathrm{q}_\pi$ 를 구할 수 있었다. (Lec3 "Planning" by DP)
지금부터는 environment(MDP)가 initially unknown!
즉, $\mathcal{P}^\pi$ 와 $\mathcal{R}^\pi$ 를 모르기 때문에 다른 방법으로 $\mathrm{v}_\pi$ , $\mathrm{q}_\pi$ 를 구해야 한다.
=> Model-Free Reinforcement Learning
Note: POMDP vs. initially unknown MDP
- POMDP에서는 current states를 정확히 observe 할 수 없다. (belief state로 확률적으로 정의하게 된다.)
- Initially unknown MDP에서는 미래를 정확히 예측할 수 없을 뿐이지, current states를 observe 할 수는 있다.
Model-free prediction
- Estimate the value function of an unknown MDP (Policy is given!)
- 지난 강의(Planning)에서의 Policy Evaluation에 해당함
Model-free control (다음 시간)
- Optimize the value function of an unknown MDP
- 지난 강의에서의 Policy Improvement에 해당함

Methods for Model-free predictions

Monte-Carlo Learning (MC)
- Go all the ways to the end of a trajectory, and estimate the value by looking at sample return!
- Inefficient but powerful
Temporal-Difference Learning (TD)
- Go one step ahead and estimate the return from the step
- More efficient
TD( $\lambda$ )
- Unify 2 and 3

Monte-Carlo Reinforcement Learning

Learn from complete episodes (No bootstrapping)
Assume that value = empirical mean return
Can be applied to episodic MDPs (all episodes must terminate)

Monte-Carlo Policy Evaluation

A policy $\pi$ is given
Learn $\mathrm{v}_\pi$ from the agent's episodes of experience under $\pi$
$S_1,A_1,R_2,...,S_k \sim \pi$
Recall,
- Return (total discounted reward):
  $G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...+\gamma^{T-t-1}R_T$
  (우리가 episode를 갖고 있기 때문에 이 값은 구할 수 있다!)
- Value function (the expected return):
  $v_\pi(s)=\mathbb{E}[G_t|S_t=s]$
MC uses empirical mean return instead of expected return

First-Visit Monte-Carlo Policy Evaluation

여러 episodes를 갖고 있다고 하자.
To evaluate state s,
- For episode in episodes:
  - Let t be the first time-step that state s is visited in an episod
    (t 이후에 s를 다시 방문할 수 있지만 t만 생각하자!)
  - Increment counter $N(s) \gets N(s) + 1$
  - Increment total return $S(s) \gets S(s) + G_t$
- Value is estimated by $V(s)=S(s)/N(s)$
- By law of large numbers, $V(s) \to v_\pi(s)$ as $N(s) \to \infin$

Every-Visit Monte-Carlo Policy Evaluation

여러 episodes를 갖고 있다고 하자.
To evaluate state s,
- For episode in episodes:
  - Let t be the every time-step that state s is visited in an episode
    (하나의 episode에서 s를 여러 번 방문하면 여러 번 update 될 수 있다.)
  - Increment counter $N(s) \gets N(s) + 1$
  - Increment total return $S(s) \gets S(s) + G_t$
- Value is estimated by $V(s)=S(s)/N(s)$
- Again, $V(s) \to v_\pi(s)$ as $N(s) \to \infin$

Example: Blackjack Value Function after Monte-Carlo Learning

States:
- Current sum (12-21) ... 12 미만일 때는 무조건 한 장 더 받는 게 유리함
- Dealer's showing card (ace-10)
- Do I have a "useable" ace? (yes-no)
Action:
- Stick: stop receiving cards (and terminate)
- Twist: take another card
Reward:
- Reward for stick:
  - +1 if sum of cards > sum of dealer cards
  - 0 if sum of cards = sum of dealer cards
  - -1 if sum of cards < sum of dealer cards
- Reward for twist:
  - -1 if sum of cards > 21 (and terminate)
  - 0 otherwise (and select action again!)

Simple Policy: stick if sum of cards $\geq$ 20, otherwise twist.
이때의 value function을 MC로 구하면 아래 사진과 같다.

Image from: here

Incremental Mean

$\mu_k=\frac{1}{k}\sum\limits_{j=1}^{k}\mathrm{x}_j\\ \space\space\space\space\space\space=\frac{1}{k}\Big(\mathrm{x}_k+\sum\limits_{j=1}^{k-1}\mathrm{x}_j\Big)\\ \space\space\space\space\space\space=\frac{1}{k}(\mathrm{x}_k+(k-1)\mu_{k-1})\\ \space\space\space\space\space\space=\mu_{k-1}+\frac{1}{k}(\mathrm{x}_k-\mu_{k-1})$
- $\mu_{k-1}$ : Previous mean estimate
- $\frac{1}{k}$ : step size
- $\mathrm{x}_k-\mu_{k-1}$ : Error term (actual $\mathrm{x}_k$ 과 previous estimate $\mu_{k-1}$ 의 차이
- $\therefore$ correct the prediction to the direction of the error!

Incremental Monte-Carlo Updates

앞에서는 여러 episodes를 갖고 있는 상태에서 한 번에 update!
Now, update V(s) every episode incrementally
- For each state $S_t$ with return $G_t$
  - $N(S_t) \gets N(S_t)+1$
  - $V(S_t) \gets V(S_t)+\frac{1}{N(S_t)}(G_t-V(S_t))$
- In non-stationary problems, it can be useful to track a running mean, i.e., forget old episodes
  (RL에서는 policy를 계속 update하기 때문에 최신 value를 더 크게 반영하도록 running mean을 쓰는 게 더 적합하다!)
  - $V(S_t) \gets \alpha(G_t-V(S_t))$

Temporal-Difference Learning

Learn directly from incomplete episodes by bootstrapping
- Bootstrapping: Substituting the remainder of the trajectory with an estimate of what would happen from that point onward.
Updates a guess towards a guess

MC and TD

Goal: Learn $\mathrm{v}_\pi$ online from experience under policy $\pi$
Incremental every-visit MC:
- $V(S_t) \gets V(S_t) + \alpha(G_t-V(S_t))$
  Update $V(S_t)$ toward $G_t$ (actual return)
Simplest TD, TD(0):
- $V(S_t) \gets V(S_t) + \alpha[(R_{t+1}+ \gamma V(S_{t+1}))-V(S_t)]$
  Update $V(S_t)$ toward $R_{t+1}+ \gamma V(S_{t+1})$ (estimated return)
- $R_{t+1}+ \gamma V(S_{t+1})$ is called the TD target
- $\delta_t=R_{t+1}+ \gamma V(S_{t+1}) - V(S_t)$ is called the TD error
- TD target은 $V(S_t)$ 보다 one-step만큼 더 정확한 estimate!

Advantages and Disadvantages of MC vs. TD

TD can learn before knowing the final outcome
MC must wait until the end of episode before return is known
TD can learn without the final outcome (can learn from incomplete sequences)
TD works in continuing environments
MC can only learn from complete sequences
MC only works for episodic (terminating) environments
Bias/Variance Trade-off
- Bias: MC < TD
  - $G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...+\gamma^{T-t-1}R_{T}$ is unbiased estimate of $v_\pi(S_t)$
    (MC has zero bias => Good convergence properties)
  - TD target $R_{t+1}+\gamma V(S_{t+1})$ is biased estimate of $v_\pi(S_t)$
    (Biased toward previous $V$ => TD is sensitive to initial value)
- Variance: MC > TD
  - $G_t$ depends on many random actions, transitions, and rewards
    (확률 변수를 더할 수록 variance는 커진다)
  - TD target depends on one random action, transition, and reward
    (Variance가 작다 => more efficient!)

Example: Random Walk, MC vs. TD

Image from: here

RMS error averaged over states: $\sqrt{\mathbb{E}{((\mathrm{v}_{estimated}-\mathrm{v}_{true}})^2)}$
TD가 더 빨리 수렴하는 것을 볼 수 있다. (TD is more efficient!)

Batch MC and TD

MC and TD converges: $V(s) \to v_\pi(s)$ as #episodes $\to \infin$
For finite number of episodes?
- (non-incremental) MC converges to solution with minimum MSE
  - Best fit to the observed returns
  - $\sum\limits_{k=1}^{K}\sum\limits_{t=1}^{T_k}(G_t^k-V(s_t^k))^2$ ... minimize!
- TD(0) converges to solution of maximum likelihood Markov model
  - Solution to the MDP $\lang \mathcal{S},\mathcal{A},\mathcal{\hat{P}},\mathcal{\hat{R}},\gamma \rang$ that best fits the data
  - $\hat{P}^a_{ss'}=\frac{1}{N(s,a)}\sum\limits_{k=1}^{K}\sum\limits_{t=1}^{T_k}\bold{1}(s_t^k,a_t^k,s_{t+1}^k=s,a,s')$
    Transition Prob = Relative frequency
  - $\hat{R}^a_{ss'}=\frac{1}{N(s,a)}\sum\limits_{k=1}^{K}\sum\limits_{t=1}^{T_k}\bold{1}(s_t^k,a_t^k=s,a)r_t^k$
    Reward: k episodes 동안 등장한 $(s,a)$ 조합에 대한 reward의 평균
  - 즉, TD는 $\hat{P}^a_{ss'}, \hat{R}^a_{ss'}$ 를 갖고 MDP를 implicitly build 한다!
MC vs. TD
- TD exploits Markov property (Markov env에서 더 효율적)
- MC does not exploit Markov property (non-Markov env에서 더 효율적)

Unified View

Bootstrapping and Sampling

Bootstrapping: update involves an estimate
- MC does not bootstrap (MC uses actual return)
- DP bootstraps
- TD bootstraps
Sampling: update samples an expectation
- MC samples
- DP does not sample (DP considers all possibilities)
- TD samples
Unified View of Reinforcement Learning

Image from: here

TD( $\lambda$ )

n-Step Target

앞에서 본 TD target: $R_{t+1}+ \gamma V(S_{t+1})$
이것은 현재 state의 바로 다음 state만 생각하는 1-step target이다.
Let TD target look n steps into the future!

Image from: here

n-step TD는 현재 state에서 n-step만큼 진행한 것을 sample로 생각한다.
MC(= $\infin$ -step TD)는 끝날 때까지 진행한 것을 sample로 생각한다.

n-Step Return

$n=1 \space\space\space\space\space G_t^{(1)}=R_{t+1}+\gamma{V(S_{t+1})}$
$n=2 \space\space\space\space\space G_t^{(2)}=R_{t+1}+\gamma{R_{t+2}}+\gamma^2{V(S_{t+2})}$
$\vdots \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space \vdots$
$n=\infin \space\space\space G_t^{(\infin)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{T-t-1}R_T$
Define the n-step return
$G_t^{(n)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{n-1}{R_{t+n}}+\gamma^n{V(S_{t+n})}$
이게 n-step TD target이 된다!
n-step TD learning
$V(S_t) \gets V(S_t)+\alpha(G_t^{(n)}-V(S_t))$

Large Random Walk Example

What is the best value for n?

Image from: here

문제마다, setting마다 다르다.
(optimal n is not a robust value!)

Averaging n-Step Returns

여러 n값으로 return을 구해서 평균을 낼 수 있지 않을까?
- e.g. average the 2-step and 4-step returns
  $\frac{1}{2}G^{(2)}+\frac{1}{2}G^{(4)}$
- Combines information from two different time-steps!

$\lambda$ -return

t시점에서, 모든 n에 대한 return을 적절한 weight를 곱해 다 합쳐보자!
The $\lambda$ -return $G^\lambda_t$ combines all n-step return $G_t^{(n)}$ , using weight $(1-\lambda)\lambda^{n-1}$
$G^\lambda_t=(1-\lambda)\sum\limits_{n=1}^{\infin}\lambda^{n-1}G^{(n)}_t$
Note: $(1-\lambda)\sum\limits_{n=1}^{T-t-1}\lambda^{n-1}=(1-\lambda)\frac{1-\lambda^{T-t-1}}{1-\lambda}=1-\lambda^{T-t-1}\approx1$
=> $\lambda$ -return은 $G_t^{(1)},G_t^{(2)},..., G_t^{(T-t-1)}$ 의 weighted average!
=> weight의 합이 정확히 1이 되도록 하기 위해서는 $G_t^\lambda$ 를 구하는 식 마지막에 $\lambda^{T-t-1}G_t^{(T-t-1)}$ 를 더해주면 된다.
Note2: $\lambda=0$ 이면 TD(0), $\lambda=1$ 이면 MC
$\lambda$ 는 hyperparameter, Optimal을 찾아줘야 한다.

Image from: here

Forward View of TD( $\lambda$ )

Image from: here

$V(S_t) \gets V(S_t) + \alpha\left(G_t^{\lambda}-V(S_t)\right)$
where, $G_t^{\lambda}=(1-\lambda)\sum\limits_{n=1}^{\infin}\lambda^{n-1}G_t^{(n)}$
Update value function towards the $\lambda$ -return
Disadvantages
- Like MC, can only be computed from complete episodes
- 멀리 있는 states들은 여러 번 계산되기 때문에 비효율적!

Backward View of TD( $\lambda$ )

멀리 있는 states의 return $G_t^{(n)}$ 은 현재에 적게 반영된다.
(weight: $(1-\lambda)\lambda^{n-1}$ , n이 커질수록 weight는 작아진다.)
한편 future return으로 current state를 update 하는 걸 반대로 생각하면, 현재의 return으로 과거를 update 한다고 말할 수 있다. => Backward View of TD( $\lambda$ )
- 이때 가까운 과거의 states는 더 크게, 먼 과거의 states는 더 작게 update 해야 한다.
- Eligibility traces를 이용해서 구현한다.
Incremental mechanism for approximating the forward view TD( $\lambda$ )
We can update online, every step, from incomplete sequences
Recall, forward view of TD( $\lambda$ ) requires complete episodes!

Eligibility Traces

현재로 과거를 update 할 때, 어떤 state s에 대해,
- 1) Frequency heuristic: 과거에 s를 많이 방문했으면 weight 증가
- 2) Recency heuristic: 최근에 s를 방문했으면 weight 증가- Frequency
Eligibility traces combine both heuristics!
$\forall s \in \mathcal{S},$
- $E_0(s)=0$
- $E_t(s)=\gamma\lambda{E_{t-1}(s)}+\bold{1}(S_t=s)$

Image from: here

Backward View TD( $\lambda$ )

매 step, 모든 s에 대해 eligibility trace를 update 해준다.
Notation
$E_t(s)$ : Eligibility trace of state s, at time t
$\delta_t$ : TD-error (1-step return)
$\delta_t=R_{t+1}+\gamma{V(S_{t+1})}-V(S_t)$
$V(s) \gets V(s) + \alpha\delta_tE_t(s), \forall s \in \mathcal{S}$
$E_t$ 덕분에 현재의 TD-target(1-step return)으로 모든 s에 대해 적절한 크기의 update를 할 수 있다!

TD( $\lambda$ ) and TD(0)
( $E_t$ 식에서) $\lambda=0$ 이면 (i.e., $E_t$ 가 없으면) TD(0)와 같다.
(그래서 TD(0)라고 부른다!)
=> t 시점에 방문한 state만 update
TD( $\lambda$ ) and MC
$\lambda=1$ 이고 ( $\gamma$ 가 유일한 discounting factor) offline update를 하면 MC와 같다.
(i.e., TD(1)=MC when TD updates the value function offline)
=> Credit is deferred until end of the episode

Relationship Between Forward and Backward TD

Theorem
The sum of offline updates is identical for forward-view and backward-view TD( $\lambda$ )
For given state $s$ ,
$\sum\limits_{t=1}^T{\alpha\delta_tE_t(s)}=\sum\limits_{t=1}^T{\alpha\left(G_t^\lambda-V(S_t)\right)\bold{1}(S_t=s)}$
좌변이 backward view TD( $\lambda$ ), 우변이 forward view TD( $\lambda$ )
자세한 내용은 슬라이드 44-48 참고!

(offline) TD(1) is exactly equivalent to every-visit MC
(online) TD(1) is roughly equivalent to every-visit MC
*Note: Exact Online TD( $\lambda$ ) achieves perfect equivalence by using a slightly different form of eligibility trace(Sutton and von Seijen, ICML 2014)*

Summary of Forward and Backward TD( $\lambda$ )

Image from: here

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

Sanghyeok Choi

Lazy Enthusiast

이전 포스트

Silver RL (3) Planning by Dynamic Programming

다음 포스트

Silver RL (4) Model-Free Prediction

Intro_to_RL

Introduction

Methods for Model-free predictions

Monte-Carlo Reinforcement Learning

Monte-Carlo Policy Evaluation

First-Visit Monte-Carlo Policy Evaluation

Every-Visit Monte-Carlo Policy Evaluation

Example: Blackjack Value Function after Monte-Carlo Learning

Incremental Mean

Incremental Monte-Carlo Updates

Temporal-Difference Learning

MC and TD

Advantages and Disadvantages of MC vs. TD

Example: Random Walk, MC vs. TD

Batch MC and TD

Unified View

Bootstrapping and Sampling

TD( $\lambda$ )

n-Step Target

n-Step Return

Large Random Walk Example

Averaging n-Step Returns

$\lambda$ -return

Forward View of TD( $\lambda$ )

Backward View of TD( $\lambda$ )

Eligibility Traces

Backward View TD( $\lambda$ )

Relationship Between Forward and Backward TD

Summary of Forward and Backward TD( $\lambda$ )

Silver RL (3) Planning by Dynamic Programming

Silver RL (5) Model-Free Control

0개의 댓글

관련 채용 정보

Silver RL (4) Model-Free Prediction

Intro_to_RL

Introduction

Methods for Model-free predictions

Monte-Carlo Reinforcement Learning

Monte-Carlo Policy Evaluation

First-Visit Monte-Carlo Policy Evaluation

Every-Visit Monte-Carlo Policy Evaluation

Example: Blackjack Value Function after Monte-Carlo Learning

Incremental Mean

Incremental Monte-Carlo Updates

Temporal-Difference Learning

MC and TD

Advantages and Disadvantages of MC vs. TD

Example: Random Walk, MC vs. TD

Batch MC and TD

Unified View

Bootstrapping and Sampling

TD(λ\lambdaλ)

n-Step Target

n-Step Return

Large Random Walk Example

Averaging n-Step Returns

λ\lambdaλ-return

Forward View of TD(λ\lambdaλ)

Backward View of TD(λ\lambdaλ)

Eligibility Traces

Backward View TD(λ\lambdaλ)

Relationship Between Forward and Backward TD

Summary of Forward and Backward TD(λ\lambdaλ)

Silver RL (3) Planning by Dynamic Programming

Silver RL (5) Model-Free Control

0개의 댓글

관련 채용 정보

TD( $\lambda$ )

$\lambda$ -return

Forward View of TD( $\lambda$ )

Backward View of TD( $\lambda$ )

Backward View TD( $\lambda$ )

Summary of Forward and Backward TD( $\lambda$ )