David Silver 교수님의 Introduction to Reinforcement Learning (Website )
Lecture 4: Model-Free Prediction (Youtube ) 강의 내용을 정리했습니다.
Introduction
이전까지는 envirnment(MDP)를 알고 있다 고 가정했다.
즉, given π \pi π 에 대해 transition matrix P π \mathcal{P}^\pi P π 와 reward R π \mathcal{R}^\pi R π 가 알려져 있고, 이를 이용해 v π \mathrm{v}_\pi v π , q π \mathrm{q}_\pi q π 를 구할 수 있었다. (Lec3 "Planning" by DP)
지금부터는 environment(MDP)가 initially unknown !
즉, P π \mathcal{P}^\pi P π 와 R π \mathcal{R}^\pi R π 를 모르기 때문에 다른 방법으로 v π \mathrm{v}_\pi v π , q π \mathrm{q}_\pi q π 를 구해야 한다.
=> Model-Free Reinforcement Learning
Note: POMDP vs. initially unknown MDP
POMDP에서는 current states를 정확히 observe 할 수 없다. (belief state로 확률적으로 정의하게 된다.)
Initially unknown MDP에서는 미래를 정확히 예측할 수 없을 뿐이지, current states를 observe 할 수는 있다.
Model-free prediction
Estimate the value function of an unknown MDP (Policy is given!)
지난 강의(Planning)에서의 Policy Evaluation에 해당함
Model-free control (다음 시간)
Optimize the value function of an unknown MDP
지난 강의에서의 Policy Improvement에 해당함
Methods for Model-free predictions
Monte-Carlo Learning (MC)
Go all the ways to the end of a trajectory, and estimate the value by looking at sample return!
Inefficient but powerful
Temporal-Difference Learning (TD)
Go one step ahead and estimate the return from the step
More efficient
TD(λ \lambda λ )
Monte-Carlo Reinforcement Learning
Learn from complete episodes (No bootstrapping)
Assume that value = empirical mean return
Can be applied to episodic MDPs (all episodes must terminate)
Monte-Carlo Policy Evaluation
A policy π \pi π is given
Learn v π \mathrm{v}_\pi v π from the agent's episodes of experience under π \pi π
S 1 , A 1 , R 2 , . . . , S k ∼ π S_1,A_1,R_2,...,S_k \sim \pi S 1 , A 1 , R 2 , . . . , S k ∼ π
Recall,
Return (total discounted reward):
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...+\gamma^{T-t-1}R_T G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T
(우리가 episode를 갖고 있기 때문에 이 값은 구할 수 있다!)
Value function (the expected return ):
v π ( s ) = E [ G t ∣ S t = s ] v_\pi(s)=\mathbb{E}[G_t|S_t=s] v π ( s ) = E [ G t ∣ S t = s ]
MC uses empirical mean return instead of expected return
First-Visit Monte-Carlo Policy Evaluation
여러 episodes를 갖고 있다고 하자.
To evaluate state s,
For episode in episodes:
Let t be the first time-step that state s is visited in an episod
(t 이후에 s를 다시 방문할 수 있지만 t만 생각하자!)
Increment counter N ( s ) ← N ( s ) + 1 N(s) \gets N(s) + 1 N ( s ) ← N ( s ) + 1
Increment total return S ( s ) ← S ( s ) + G t S(s) \gets S(s) + G_t S ( s ) ← S ( s ) + G t
Value is estimated by V ( s ) = S ( s ) / N ( s ) V(s)=S(s)/N(s) V ( s ) = S ( s ) / N ( s )
By law of large numbers, V ( s ) → v π ( s ) V(s) \to v_\pi(s) V ( s ) → v π ( s ) as N ( s ) → ∞ N(s) \to \infin N ( s ) → ∞
Every-Visit Monte-Carlo Policy Evaluation
여러 episodes를 갖고 있다고 하자.
To evaluate state s,
For episode in episodes:
Let t be the every time-step that state s is visited in an episode
(하나의 episode에서 s를 여러 번 방문하면 여러 번 update 될 수 있다.)
Increment counter N ( s ) ← N ( s ) + 1 N(s) \gets N(s) + 1 N ( s ) ← N ( s ) + 1
Increment total return S ( s ) ← S ( s ) + G t S(s) \gets S(s) + G_t S ( s ) ← S ( s ) + G t
Value is estimated by V ( s ) = S ( s ) / N ( s ) V(s)=S(s)/N(s) V ( s ) = S ( s ) / N ( s )
Again, V ( s ) → v π ( s ) V(s) \to v_\pi(s) V ( s ) → v π ( s ) as N ( s ) → ∞ N(s) \to \infin N ( s ) → ∞
Example: Blackjack Value Function after Monte-Carlo Learning
States:
Current sum (12-21) ... 12 미만일 때는 무조건 한 장 더 받는 게 유리함
Dealer's showing card (ace-10)
Do I have a "useable" ace? (yes-no)
Action:
Stick: stop receiving cards (and terminate)
Twist: take another card
Reward:
Reward for stick:
+1 if sum of cards > sum of dealer cards
0 if sum of cards = sum of dealer cards
-1 if sum of cards < sum of dealer cards
Reward for twist:
-1 if sum of cards > 21 (and terminate)
0 otherwise (and select action again!)
Simple Policy: stick if sum of cards ≥ \geq ≥ 20, otherwise twist.
이때의 value function을 MC로 구하면 아래 사진과 같다.
Image from:
here
Incremental Mean
Incremental Monte-Carlo Updates
앞에서는 여러 episodes를 갖고 있는 상태에서 한 번에 update!
Now, update V(s) every episode incrementally
For each state S t S_t S t with return G t G_t G t
N ( S t ) ← N ( S t ) + 1 N(S_t) \gets N(S_t)+1 N ( S t ) ← N ( S t ) + 1
V ( S t ) ← V ( S t ) + 1 N ( S t ) ( G t − V ( S t ) ) V(S_t) \gets V(S_t)+\frac{1}{N(S_t)}(G_t-V(S_t)) V ( S t ) ← V ( S t ) + N ( S t ) 1 ( G t − V ( S t ) )
In non-stationary problems, it can be useful to track a running mean , i.e., forget old episodes
(RL에서는 policy를 계속 update하기 때문에 최신 value를 더 크게 반영하도록 running mean을 쓰는 게 더 적합하다!)
V ( S t ) ← α ( G t − V ( S t ) ) V(S_t) \gets \alpha(G_t-V(S_t)) V ( S t ) ← α ( G t − V ( S t ) )
Temporal-Difference Learning
Learn directly from incomplete episodes by bootstrapping
Bootstrapping : Substituting the remainder of the trajectory with an estimate of what would happen from that point onward.
Updates a guess towards a guess
MC and TD
Goal: Learn v π \mathrm{v}_\pi v π online from experience under policy π \pi π
Incremental every-visit MC:
V ( S t ) ← V ( S t ) + α ( G t − V ( S t ) ) V(S_t) \gets V(S_t) + \alpha(G_t-V(S_t)) V ( S t ) ← V ( S t ) + α ( G t − V ( S t ) )
Update V ( S t ) V(S_t) V ( S t ) toward G t G_t G t (actual return)
Simplest TD, TD(0) :
V ( S t ) ← V ( S t ) + α [ ( R t + 1 + γ V ( S t + 1 ) ) − V ( S t ) ] V(S_t) \gets V(S_t) + \alpha[(R_{t+1}+ \gamma V(S_{t+1}))-V(S_t)] V ( S t ) ← V ( S t ) + α [ ( R t + 1 + γ V ( S t + 1 ) ) − V ( S t ) ]
Update V ( S t ) V(S_t) V ( S t ) toward R t + 1 + γ V ( S t + 1 ) R_{t+1}+ \gamma V(S_{t+1}) R t + 1 + γ V ( S t + 1 ) (estimated return)
R t + 1 + γ V ( S t + 1 ) R_{t+1}+ \gamma V(S_{t+1}) R t + 1 + γ V ( S t + 1 ) is called the TD target
δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t ) \delta_t=R_{t+1}+ \gamma V(S_{t+1}) - V(S_t) δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t ) is called the TD error
TD target은 V ( S t ) V(S_t) V ( S t ) 보다 one-step만큼 더 정확한 estimate!
Advantages and Disadvantages of MC vs. TD
TD can learn before knowing the final outcome
MC must wait until the end of episode before return is known
TD can learn without the final outcome (can learn from incomplete sequences)
TD works in continuing environments
MC can only learn from complete sequences
MC only works for episodic (terminating) environments
Bias/Variance Trade-off
Bias: MC < TD
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...+\gamma^{T-t-1}R_{T} G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T is unbiased estimate of v π ( S t ) v_\pi(S_t) v π ( S t )
(MC has zero bias => Good convergence properties)
TD target R t + 1 + γ V ( S t + 1 ) R_{t+1}+\gamma V(S_{t+1}) R t + 1 + γ V ( S t + 1 ) is biased estimate of v π ( S t ) v_\pi(S_t) v π ( S t )
(Biased toward previous V V V => TD is sensitive to initial value)
Variance: MC > TD
G t G_t G t depends on many random actions, transitions, and rewards
(확률 변수를 더할 수록 variance는 커진다)
TD target depends on one random action, transition, and reward
(Variance가 작다 => more efficient!)
Example: Random Walk, MC vs. TD
Image from:
here
RMS error averaged over states: E ( ( v e s t i m a t e d − v t r u e ) 2 ) \sqrt{\mathbb{E}{((\mathrm{v}_{estimated}-\mathrm{v}_{true}})^2)} E ( ( v e s t i m a t e d − v t r u e ) 2 )
TD가 더 빨리 수렴하는 것을 볼 수 있다. (TD is more efficient!)
Batch MC and TD
MC and TD converges: V ( s ) → v π ( s ) V(s) \to v_\pi(s) V ( s ) → v π ( s ) as #episodes → ∞ \to \infin → ∞
For finite number of episodes?
(non-incremental) MC converges to solution with minimum MSE
Best fit to the observed returns
∑ k = 1 K ∑ t = 1 T k ( G t k − V ( s t k ) ) 2 \sum\limits_{k=1}^{K}\sum\limits_{t=1}^{T_k}(G_t^k-V(s_t^k))^2 k = 1 ∑ K t = 1 ∑ T k ( G t k − V ( s t k ) ) 2 ... minimize!
TD(0) converges to solution of maximum likelihood Markov model
Solution to the MDP ⟨ S , A , P ^ , R ^ , γ ⟩ \lang \mathcal{S},\mathcal{A},\mathcal{\hat{P}},\mathcal{\hat{R}},\gamma \rang ⟨ S , A , P ^ , R ^ , γ ⟩ that best fits the data
P ^ s s ′ a = 1 N ( s , a ) ∑ k = 1 K ∑ t = 1 T k 1 ( s t k , a t k , s t + 1 k = s , a , s ′ ) \hat{P}^a_{ss'}=\frac{1}{N(s,a)}\sum\limits_{k=1}^{K}\sum\limits_{t=1}^{T_k}\bold{1}(s_t^k,a_t^k,s_{t+1}^k=s,a,s') P ^ s s ′ a = N ( s , a ) 1 k = 1 ∑ K t = 1 ∑ T k 1 ( s t k , a t k , s t + 1 k = s , a , s ′ )
Transition Prob = Relative frequency
R ^ s s ′ a = 1 N ( s , a ) ∑ k = 1 K ∑ t = 1 T k 1 ( s t k , a t k = s , a ) r t k \hat{R}^a_{ss'}=\frac{1}{N(s,a)}\sum\limits_{k=1}^{K}\sum\limits_{t=1}^{T_k}\bold{1}(s_t^k,a_t^k=s,a)r_t^k R ^ s s ′ a = N ( s , a ) 1 k = 1 ∑ K t = 1 ∑ T k 1 ( s t k , a t k = s , a ) r t k
Reward: k episodes 동안 등장한 ( s , a ) (s,a) ( s , a ) 조합에 대한 reward의 평균
즉, TD는 P ^ s s ′ a , R ^ s s ′ a \hat{P}^a_{ss'}, \hat{R}^a_{ss'} P ^ s s ′ a , R ^ s s ′ a 를 갖고 MDP를 implicitly build 한다!
MC vs. TD
TD exploits Markov property (Markov env에서 더 효율적)
MC does not exploit Markov property (non-Markov env에서 더 효율적)
Unified View
Bootstrapping and Sampling
Bootstrapping: update involves an estimate
MC does not bootstrap (MC uses actual return)
DP bootstraps
TD bootstraps
Sampling: update samples an expectation
MC samples
DP does not sample (DP considers all possibilities)
TD samples
Unified View of Reinforcement Learning
Image from:
here
TD(λ \lambda λ )
n-Step Target
앞에서 본 TD target: R t + 1 + γ V ( S t + 1 ) R_{t+1}+ \gamma V(S_{t+1}) R t + 1 + γ V ( S t + 1 )
이것은 현재 state의 바로 다음 state만 생각하는 1-step target이다.
Let TD target look n steps into the future!
Image from:
here
n-step TD는 현재 state에서 n-step만큼 진행한 것을 sample로 생각한다.
MC(=∞ \infin ∞ -step TD)는 끝날 때까지 진행한 것을 sample로 생각한다.
n-Step Return
n = 1 G t ( 1 ) = R t + 1 + γ V ( S t + 1 ) n=1 \space\space\space\space\space G_t^{(1)}=R_{t+1}+\gamma{V(S_{t+1})} n = 1 G t ( 1 ) = R t + 1 + γ V ( S t + 1 )
n = 2 G t ( 2 ) = R t + 1 + γ R t + 2 + γ 2 V ( S t + 2 ) n=2 \space\space\space\space\space G_t^{(2)}=R_{t+1}+\gamma{R_{t+2}}+\gamma^2{V(S_{t+2})} n = 2 G t ( 2 ) = R t + 1 + γ R t + 2 + γ 2 V ( S t + 2 )
⋮ ⋮ \vdots \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space \vdots ⋮ ⋮
n = ∞ G t ( ∞ ) = R t + 1 + γ R t + 2 + . . . + γ T − t − 1 R T n=\infin \space\space\space G_t^{(\infin)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{T-t-1}R_T n = ∞ G t ( ∞ ) = R t + 1 + γ R t + 2 + . . . + γ T − t − 1 R T
Define the n-step return
G t ( n ) = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n V ( S t + n ) G_t^{(n)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{n-1}{R_{t+n}}+\gamma^n{V(S_{t+n})} G t ( n ) = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n V ( S t + n )
이게 n-step TD target이 된다!
n-step TD learning
V ( S t ) ← V ( S t ) + α ( G t ( n ) − V ( S t ) ) V(S_t) \gets V(S_t)+\alpha(G_t^{(n)}-V(S_t)) V ( S t ) ← V ( S t ) + α ( G t ( n ) − V ( S t ) )
Large Random Walk Example
What is the best value for n?
Image from:
here
문제마다, setting마다 다르다.
(optimal n is not a robust value!)
Averaging n-Step Returns
여러 n값으로 return을 구해서 평균을 낼 수 있지 않을까?
e.g. average the 2-step and 4-step returns
1 2 G ( 2 ) + 1 2 G ( 4 ) \frac{1}{2}G^{(2)}+\frac{1}{2}G^{(4)} 2 1 G ( 2 ) + 2 1 G ( 4 )
Combines information from two different time-steps!
λ \lambda λ -return
t시점에서, 모든 n에 대한 return을 적절한 weight를 곱해 다 합쳐보자!
The λ \lambda λ -return G t λ G^\lambda_t G t λ combines all n-step return G t ( n ) G_t^{(n)} G t ( n ) , using weight ( 1 − λ ) λ n − 1 (1-\lambda)\lambda^{n-1} ( 1 − λ ) λ n − 1
G t λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 G t ( n ) G^\lambda_t=(1-\lambda)\sum\limits_{n=1}^{\infin}\lambda^{n-1}G^{(n)}_t G t λ = ( 1 − λ ) n = 1 ∑ ∞ λ n − 1 G t ( n )
Note: ( 1 − λ ) ∑ n = 1 T − t − 1 λ n − 1 = ( 1 − λ ) 1 − λ T − t − 1 1 − λ = 1 − λ T − t − 1 ≈ 1 (1-\lambda)\sum\limits_{n=1}^{T-t-1}\lambda^{n-1}=(1-\lambda)\frac{1-\lambda^{T-t-1}}{1-\lambda}=1-\lambda^{T-t-1}\approx1 ( 1 − λ ) n = 1 ∑ T − t − 1 λ n − 1 = ( 1 − λ ) 1 − λ 1 − λ T − t − 1 = 1 − λ T − t − 1 ≈ 1
=> λ \lambda λ -return은 G t ( 1 ) , G t ( 2 ) , . . . , G t ( T − t − 1 ) G_t^{(1)},G_t^{(2)},..., G_t^{(T-t-1)} G t ( 1 ) , G t ( 2 ) , . . . , G t ( T − t − 1 ) 의 weighted average!
=> weight의 합이 정확히 1이 되도록 하기 위해서는 G t λ G_t^\lambda G t λ 를 구하는 식 마지막에 λ T − t − 1 G t ( T − t − 1 ) \lambda^{T-t-1}G_t^{(T-t-1)} λ T − t − 1 G t ( T − t − 1 ) 를 더해주면 된다.
Note2 : λ = 0 \lambda=0 λ = 0 이면 TD(0), λ = 1 \lambda=1 λ = 1 이면 MC
λ \lambda λ 는 hyperparameter, Optimal을 찾아줘야 한다.
Image from:
here
Forward View of TD(λ \lambda λ )
Image from:
here
V ( S t ) ← V ( S t ) + α ( G t λ − V ( S t ) ) V(S_t) \gets V(S_t) + \alpha\left(G_t^{\lambda}-V(S_t)\right) V ( S t ) ← V ( S t ) + α ( G t λ − V ( S t ) )
where, G t λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 G t ( n ) G_t^{\lambda}=(1-\lambda)\sum\limits_{n=1}^{\infin}\lambda^{n-1}G_t^{(n)} G t λ = ( 1 − λ ) n = 1 ∑ ∞ λ n − 1 G t ( n )
Update value function towards the λ \lambda λ -return
Disadvantages
Like MC, can only be computed from complete episodes
멀리 있는 states들은 여러 번 계산되기 때문에 비효율적!
Backward View of TD(λ \lambda λ )
멀리 있는 states의 return G t ( n ) G_t^{(n)} G t ( n ) 은 현재에 적게 반영된다.
(weight: ( 1 − λ ) λ n − 1 (1-\lambda)\lambda^{n-1} ( 1 − λ ) λ n − 1 , n이 커질수록 weight는 작아진다.)
한편 future return으로 current state를 update 하는 걸 반대로 생각하면, 현재의 return으로 과거를 update 한다고 말할 수 있다. => Backward View of TD(λ \lambda λ )
이때 가까운 과거의 states는 더 크게, 먼 과거의 states는 더 작게 update 해야 한다.
Eligibility traces 를 이용해서 구현한다.
Incremental mechanism for approximating the forward view TD(λ \lambda λ )
We can update online , every step, from incomplete sequences
Recall, forward view of TD(λ \lambda λ ) requires complete episodes!
Eligibility Traces
현재로 과거를 update 할 때, 어떤 state s에 대해,
1) Frequency heuristic: 과거에 s를 많이 방문했으면 weight 증가
2) Recency heuristic: 최근에 s를 방문했으면 weight 증가- Frequency
Eligibility traces combine both heuristics!
∀ s ∈ S , \forall s \in \mathcal{S}, ∀ s ∈ S ,
E 0 ( s ) = 0 E_0(s)=0 E 0 ( s ) = 0
E t ( s ) = γ λ E t − 1 ( s ) + 1 ( S t = s ) E_t(s)=\gamma\lambda{E_{t-1}(s)}+\bold{1}(S_t=s) E t ( s ) = γ λ E t − 1 ( s ) + 1 ( S t = s )
Image from:
here
Backward View TD(λ \lambda λ )
매 step, 모든 s에 대해 eligibility trace를 update 해준다.
Notation
E t ( s ) E_t(s) E t ( s ) : Eligibility trace of state s, at time t
δ t \delta_t δ t : TD-error (1-step return)
δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t ) \delta_t=R_{t+1}+\gamma{V(S_{t+1})}-V(S_t) δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t )
V ( s ) ← V ( s ) + α δ t E t ( s ) , ∀ s ∈ S V(s) \gets V(s) + \alpha\delta_tE_t(s), \forall s \in \mathcal{S} V ( s ) ← V ( s ) + α δ t E t ( s ) , ∀ s ∈ S
E t E_t E t 덕분에 현재의 TD-target(1-step return)으로 모든 s에 대해 적절한 크기의 update를 할 수 있다!
TD(λ \lambda λ ) and TD(0)
(E t E_t E t 식에서) λ = 0 \lambda=0 λ = 0 이면 (i.e., E t E_t E t 가 없으면) TD(0)와 같다.
(그래서 TD(0)라고 부른다!)
=> t 시점에 방문한 state만 update
TD(λ \lambda λ ) and MC
λ = 1 \lambda=1 λ = 1 이고 (γ \gamma γ 가 유일한 discounting factor) offline update를 하면 MC와 같다.
(i.e., TD(1)=MC when TD updates the value function offline)
=> Credit is deferred until end of the episode
Relationship Between Forward and Backward TD
Theorem
The sum of offline updates is identical for forward-view and backward-view TD(λ \lambda λ )
For given state s s s ,
∑ t = 1 T α δ t E t ( s ) = ∑ t = 1 T α ( G t λ − V ( S t ) ) 1 ( S t = s ) \sum\limits_{t=1}^T{\alpha\delta_tE_t(s)}=\sum\limits_{t=1}^T{\alpha\left(G_t^\lambda-V(S_t)\right)\bold{1}(S_t=s)} t = 1 ∑ T α δ t E t ( s ) = t = 1 ∑ T α ( G t λ − V ( S t ) ) 1 ( S t = s )
좌변이 backward view TD(λ \lambda λ ), 우변이 forward view TD(λ \lambda λ )
자세한 내용은 슬라이드 44-48 참고!
(offline) TD(1) is exactly equivalent to every-visit MC
(online) TD(1) is roughly equivalent to every-visit MC
*Note: Exact Online TD(λ \lambda λ ) achieves perfect equivalence by using a slightly different form of eligibility trace(Sutton and von Seijen, ICML 2014)*
Summary of Forward and Backward TD(λ \lambda λ )
Image from:
here
혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!