David Silver 교수님의 Introduction to Reinforcement Learning (Website )
Lecture 5: Model-Free Control (Youtube ) 강의 내용을 정리했습니다.
Introduction
Model-Free Reinforcement Learning
지난 시간에 Model-free prediction을 배움
(Estimate the value function of an unknown MDP)
이번 시간엔 Model-free control을 배울 예정
(Optimize the value function of an unknown MDP)
Uses of Model Free Control
MDP model is unknown, but experience can be sampled
(e.g. Portfolio management, Game of Go)
MDP model is known, but is too big to use, except by samples
(e.g. Helicopter, Ship Steering)
On and Off-Policy Learning
On-policy learning
"Learn on the job"
Learn about policy π \pi π from experience sampled from π \pi π
Off-policy learning
"Look over someone's shoulder"
Learn about policy π \pi π form experience sampled from μ \mu μ
(e.g. 로봇을 움직이기 위해 사람의 움직임(=better policy)을 학습!)
On-Policy Monte-Carlo Control
Generalised Policy Iteration (Lec. 3)
Image from:
here
Model is known
Policy Evaluation: Estimate v π \mathrm{v}_\pi v π
Policy Improvement: Generate π ′ ≥ π \pi' \geq \pi π ′ ≥ π
Generalised Policy Iteration With Monte-Carlo Evaluation
Now, Model is unknown!
Policy Evaluation: MC policy evaluation (4장), V = v π V=\mathrm{v}_\pi V = v π ?
No! V는 sample로부터 얻었기 때문에 v π \mathrm{v}_\pi v π 와 같다는 게 보장되지 않음! (approximation이다.)
Policy Improvement: Greedy policy improvement?
Model-Free Policy Iteration Using Action-Value Function
Greedy policy improvement over V ( s ) V(s) V ( s ) requires model of MDP
Recall, π ′ ( s ) = arg max a ∈ A [ R s a + γ ∑ s ′ ∈ S P s s ′ a V ( s ′ ) ] \pi'(s)=\argmax\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}V(s')] π ′ ( s ) = a ∈ A a r g m a x [ R s a + γ s ′ ∈ S ∑ P s s ′ a V ( s ′ ) ]
Note: R s a \mathcal{R}^a_s R s a & P s s ′ a \mathcal{P}^a_{ss'} P s s ′ a 를 안다 ⟺ \iff ⟺ MDP를 안다.
P \mathcal{P} P 도 sample을 통해 근사할 수 있지만(relative frequency로), too expensive! (state× \times × action에 대해 근사시켜야 하기 때문) 게다가 이렇게 하는 건 Model을 build 하는 것이기 때문에 Model-Free의 취지에 맞지 않다!
V ( s ′ ) V(s') V ( s ′ ) 는 상대적으로 부정확한 추정치(for v π ( s ′ ) v_{\pi}(s') v π ( s ′ ) )이다.
→ \rarr → P \mathcal{P} P 도 추정치, V V V 도 추정치라서 곱해졌을 때 error가 더 크다. (내 생각)
Greedy policy improvement over Q ( s , a ) Q(s,a) Q ( s , a ) is model-free!
π ′ ( s ) = arg max a ∈ A Q ( s , a ) \pi'(s)=\argmax\limits_{a\in\mathcal{A}}Q(s,a) π ′ ( s ) = a ∈ A a r g m a x Q ( s , a )
Here, Q ( s , a ) Q(s,a) Q ( s , a ) is estimated action value function
Q ( s , a ) = 1 N ( s , a ) ∑ G t Q(s,a)=\frac{1}{N(s,a)}\sum{G_t} Q ( s , a ) = N ( s , a ) 1 ∑ G t
Note: G t G_t G t is an empirical return, with R t + 1 = R s a R_{t+1}=R_{s}^{a} R t + 1 = R s a
Q ( s , a ) Q(s,a) Q ( s , a ) 는 상대적으로 정확한 추정치(for q π ( s , a ) q_{\pi}(s,a) q π ( s , a ) )이다.
(∵ R t \because R_t ∵ R t 가 state S t S_t S t 에서의 action A t A_t A t 로부터 얻어지기 때문)
정리하자면, Control에서는 state value function V ( s ) V(s) V ( s ) 를 쓰면 improvement를 하기가 어렵다.
그래서 action value function Q ( s , a ) Q(s,a) Q ( s , a ) 를 사용해서 evaluate하고, improvement를 한다.
Generalised Policy Iteration with Action-Value Function
Image from:
here
Policy Evaluation: Monte-Carlo policy evaluation Q = q π Q = q_\pi Q = q π
Lec. 4에서 배웠던 policy evaluation을 Q에 대해 적용한다.
Or using incremental mean , for each state S t S_t S t and action A t A_t A t in an episode,
N ( S t , A t ) ← N ( S t , A t ) + 1 N(S_t,A_t) \gets N(S_t,A_t)+1 N ( S t , A t ) ← N ( S t , A t ) + 1
Q ( S t , A t ) ← Q ( S t , A t ) + 1 N ( S t , A t ) ( G t − Q ( S t , A t ) ) Q(S_t,A_t) \gets Q(S_t,A_t)+\frac{1}{N(S_t,A_t)}(G_t-Q(S_t,A_t)) Q ( S t , A t ) ← Q ( S t , A t ) + N ( S t , A t ) 1 ( G t − Q ( S t , A t ) )
Policy Improvement: Greedy policy improvement?
With greedy policy, we can stuck into local optimum.
(MC에서 sampling 되지 않은 (s,a) pair가 optimal일 수도 있기 때문!)
We need to make sure we continue to see everything!
Exploration: ϵ \epsilon ϵ -Greedy Policy Improvement
With prob. 1 − ϵ 1-\epsilon 1 − ϵ , choose the greedy action
With prob. ϵ \epsilon ϵ , choose an action at random
π ( a ∣ s ) = { ϵ / m + 1 − ϵ i f a ∗ = arg max a ∈ A Q ( s , a ) ϵ / m o t h e r w i s e \pi(a|s)=\begin{cases} \epsilon/m+1-\epsilon \space\space\space\space\space\space if \space\space a^*=\argmax\limits_{a\in\mathcal{A}}Q(s,a)\\ \epsilon/m \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space otherwise \end{cases} π ( a ∣ s ) = ⎩ ⎪ ⎨ ⎪ ⎧ ϵ / m + 1 − ϵ i f a ∗ = a ∈ A a r g m a x Q ( s , a ) ϵ / m o t h e r w i s e
Theorem
For any ϵ \epsilon ϵ -greedy policy π \pi π , the ϵ \epsilon ϵ -greedy policy π ′ \pi' π ′ w.r.t. q π q_\pi q π is an improvement,
i.e., v π ′ ( s ) ≥ v π ( s ) v_{\pi'}(s) \geq v_{\pi}(s) v π ′ ( s ) ≥ v π ( s )
Proof)
q π ( s , π ′ ( s ) ) = ∑ a ∈ A π ′ ( a ∣ s ) q π ( s , a ) = ϵ / m ∑ a ∈ A [ q π ( s , a ) + ( 1 − ϵ ) max a ∈ A q π ( s , a ) ] ≥ ϵ / m ∑ a ∈ A [ q π ( s , a ) + ( 1 − ϵ ) ∑ a ∈ A π ( a ∣ s ) − ϵ / m 1 − ϵ q π ( s , a ) ] = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) = v π ( s ) q_\pi(s, \pi'(s))=\sum\limits_{a\in\mathcal{A}}\pi'(a|s)q_\pi(s,a)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space = \epsilon/m\sum\limits_{a\in\mathcal{A}}[q_\pi(s,a)+(1-\epsilon)\max\limits_{a\in\mathcal{A}}q_\pi(s,a)]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space \geq \epsilon/m\sum\limits_{a\in\mathcal{A}}[q_\pi(s,a)+(1-\epsilon)\sum\limits_{a\in\mathcal{A}}\frac{\pi(a|s)-\epsilon/m}{1-\epsilon}q_\pi(s,a)]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space = \sum\limits_{a\in\mathcal{A}}\pi(a|s)q_\pi(s,a) = v_\pi(s) q π ( s , π ′ ( s ) ) = a ∈ A ∑ π ′ ( a ∣ s ) q π ( s , a ) = ϵ / m a ∈ A ∑ [ q π ( s , a ) + ( 1 − ϵ ) a ∈ A max q π ( s , a ) ] ≥ ϵ / m a ∈ A ∑ [ q π ( s , a ) + ( 1 − ϵ ) a ∈ A ∑ 1 − ϵ π ( a ∣ s ) − ϵ / m q π ( s , a ) ] = a ∈ A ∑ π ( a ∣ s ) q π ( s , a ) = v π ( s )
∴ v π ′ ( s ) ≥ v π ( s ) \therefore v_{\pi'}(s) \geq v_\pi(s) ∴ v π ′ ( s ) ≥ v π ( s ) ... as v π ( s ) ≥ q π ( s , π ′ ( s ) ) v_\pi(s) \geq q_\pi(s, \pi'(s)) v π ( s ) ≥ q π ( s , π ′ ( s ) ) from policy improvement theorem(Lec.3).
Monte-Carlo Policy Iteration
Image from:
here
Policy Evaluation: Monte-Carlo policy evaluation Q = q π Q = q_\pi Q = q π
Policy Improvement: ϵ \epsilon ϵ -greedy policy improvement
Monte-Carlo Policy Control
Image from:
here
정확한 Q Q Q 를 얻기 위해서는 many episodes을 sample 해야한다.
But 우리의 목적은 optimal solution을 찾는 것이므로, episode-by-episode 방식으로 update 하는 게 더 효율적이다.
At Every episode
Policy evaluation: Monte-Carlo policy evaluation, Q ≈ q π Q \approx q_\pi Q ≈ q π
Policy improvement: ϵ \epsilon ϵ -greedy policy improvement
GLIE
Greedy in the Limit with Infinite Exploration (GLIE) property
All state-action pairs are explored infinitely many times
lim k → ∞ N k ( s , a ) = ∞ \lim\limits_{k\to\infin}N_k(s,a)=\infin k → ∞ lim N k ( s , a ) = ∞
The policy converges on a greedy policy
lim k → ∞ π k ( a ∣ s ) = 1 ( a = arg max a ′ ∈ A Q k ( s , a ′ ) ) \lim\limits_{k\to\infin}\pi_k(a|s)=\bold{1}(a=\argmax\limits_{a'\in\mathcal{A}}Q_k(s,a')) k → ∞ lim π k ( a ∣ s ) = 1 ( a = a ′ ∈ A a r g m a x Q k ( s , a ′ ) )
GLIE property를 만족하도록 학습해야 한다!
ϵ \epsilon ϵ -greedy is GLIE if ϵ \epsilon ϵ reduces to zero at ϵ k = 1 k \epsilon_k=\frac{1}{k} ϵ k = k 1
GLIE Monte-Carlo Control
Sample k k k th episode using π \pi π : { S 1 , A 1 , R 2 , . . . , S T } ∼ π \{S_1,A_1,R_2,...,S_T\}\sim\pi { S 1 , A 1 , R 2 , . . . , S T } ∼ π
For each state S t S_t S t and action A t A_t A t in the episode,
N ( S t , A t ) ← N ( S t , A t ) + 1 N(S_t,A_t) \gets N(S_t,A_t)+1 N ( S t , A t ) ← N ( S t , A t ) + 1
Q ( S t , A t ) ← Q ( S t , A t ) + 1 N ( S t , A t ) ( G t − Q ( S t , A t ) ) Q(S_t,A_t) \gets Q(S_t,A_t)+\frac{1}{N(S_t,A_t)}(G_t-Q(S_t,A_t)) Q ( S t , A t ) ← Q ( S t , A t ) + N ( S t , A t ) 1 ( G t − Q ( S t , A t ) )
Improve policy based on new action-value function
ϵ ← 1 / k \epsilon \gets 1/k ϵ ← 1 / k
π ← ϵ \pi \gets \epsilon π ← ϵ -greedy( Q ) (Q) ( Q )
Theorem
GLIE Monte-Carlo control converges to the optimal action-value function,
i.e., Q ( s , a ) → q ∗ ( s , a ) Q(s,a) \to q_*(s,a) Q ( s , a ) → q ∗ ( s , a )
On-Policy Temporal-Difference Learning
MC vs. TD Control
Advantages of TD learning over MC
Lower variance
Can learn online (Incomplete sequences)
Natural Idea: use TD instead of MC in our control loop
Apply TD to compute Q ( S , A ) Q(S,A) Q ( S , A )
Use ϵ \epsilon ϵ -greedy policy improvement
Updating Action-Value Functions with Sarsa
Lec. 4에서 TD로 V V V 를 estimate 했다.
Sarsa는 TD로 Q Q Q 를 estimate 하는 것!
Sarsa:
( S , A ) (S,A) ( S , A ) ... 현재 state & action
R R R ......... reward
S ′ S' S ′ ........ 다음 state
A ′ A' A ′ ........ 다음 action (determined by some policy π \pi π )
Q ( S , A ) ← Q ( S , A ) + α ( R + γ Q ( S ′ , A ′ ) − Q ( S , A ) ) Q(S,A) \gets Q(S,A)+\alpha(R+\gamma{Q(S',A')}-Q(S,A)) Q ( S , A ) ← Q ( S , A ) + α ( R + γ Q ( S ′ , A ′ ) − Q ( S , A ) )
Here,
Q ( S , A ) Q(S,A) Q ( S , A ) is an estimation before taking a step
R + γ Q ( S ′ , A ′ ) R+\gamma{Q(S',A')} R + γ Q ( S ′ , A ′ ) is an estimation after taking a step
MC control에서 Q를 구하는 방법만 위와 같이 바꾸면 Sarsa가 된다!
On-Policy Control With Sarsa
Every time-step (MC는 every episode)
Policy Evaluation: Sarsa, Q ≈ q π Q \approx q_\pi Q ≈ q π
Policy Improvement: ϵ \epsilon ϵ -greedy policy improvement
Sarsa Algorithm for On-Policy Control
Image from: here
Note: Q ( s , a ) Q(s,a) Q ( s , a ) 는 s × a s\times a s × a table!
Theorem (Convergence of Sarsa)
Sarsa converges to the optimal action-value function under the following conditions:
GLIE sequence of policies π t ( a ∣ s ) \pi_t(a|s) π t ( a ∣ s )
Robbins-Monro sequence of step-sizes α t \alpha_t α t
∑ t = 1 ∞ α t = ∞ \sum\limits_{t=1}^{\infin}\alpha_t=\infin t = 1 ∑ ∞ α t = ∞
∑ t = 1 ∞ α t 2 < ∞ \sum\limits_{t=1}^{\infin}\alpha_t^2\lt\infin t = 1 ∑ ∞ α t 2 < ∞
=> Step size should be not too small or big!
n-Step Sarsa
앞에서 본 TD-target은 1-step Q-return!
Consider n-step returns for n = 1 , 2 , . . . , ∞ n=1,2,...,\infin n = 1 , 2 , . . . , ∞
n = 1 q t ( 1 ) = R t + 1 + γ Q ( S t + 1 , A t + 1 ) n=1 \space\space\space\space\space q_t^{(1)}=R_{t+1}+\gamma{Q(S_{t+1},A_{t+1})} n = 1 q t ( 1 ) = R t + 1 + γ Q ( S t + 1 , A t + 1 )
n = 2 q t ( 2 ) = R t + 1 + γ R t + 2 + γ 2 Q ( S t + 2 , A t + 2 ) n=2 \space\space\space\space\space q_t^{(2)}=R_{t+1}+\gamma{R_{t+2}}+\gamma^2{Q(S_{t+2},A_{t+2})} n = 2 q t ( 2 ) = R t + 1 + γ R t + 2 + γ 2 Q ( S t + 2 , A t + 2 )
⋮ ⋮ \vdots \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space \vdots ⋮ ⋮
n = ∞ q t ( ∞ ) = R t + 1 + γ R t + 2 + . . . + γ T − t − 1 R T n=\infin \space\space\space q_t^{(\infin)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{T-t-1}R_T n = ∞ q t ( ∞ ) = R t + 1 + γ R t + 2 + . . . + γ T − t − 1 R T
Define the n-step Q-return
q t ( n ) = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n Q ( S t + n , A t + n ) q_t^{(n)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{n-1}{R_{t+n}}+\gamma^n{Q(S_{t+n},A_{t+n})} q t ( n ) = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n Q ( S t + n , A t + n )
이게 TD-target이 된다!
n-step Sarsa updates Q ( s , a ) Q(s,a) Q ( s , a ) towards the n-step Q Q Q -return
Q ( S t , A t ) ← Q ( S t , A t ) + α ( q t ( n ) − Q ( S t , A t ) ) Q(S_t,A_t) \gets Q(S_t,A_t)+\alpha(q_t^{(n)}-Q(S_t,A_t)) Q ( S t , A t ) ← Q ( S t , A t ) + α ( q t ( n ) − Q ( S t , A t ) )
Forward View Sarsa(λ \lambda λ )
The q t λ q^\lambda_t q t λ return combines all n-step Q Q Q -returns q t ( n ) q^{(n)}_t q t ( n )
Using weight ( 1 − λ ) λ n − 1 (1-\lambda)\lambda^{n-1} ( 1 − λ ) λ n − 1
q t λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 q t ( n ) q^\lambda_t=(1-\lambda)\sum\limits_{n=1}^{\infin}\lambda^{n-1}q_t^{(n)} q t λ = ( 1 − λ ) n = 1 ∑ ∞ λ n − 1 q t ( n )
Forward-view Sarsa(λ \lambda λ )
Q ( S t , A t ) ← Q ( S t , A t ) + α ( q t λ − Q ( S t , A t ) ) Q(S_t,A_t) \gets Q(S_t,A_t)+\alpha(q^\lambda_t-Q(S_t,A_t)) Q ( S t , A t ) ← Q ( S t , A t ) + α ( q t λ − Q ( S t , A t ) )
Backward View Sarsa(λ \lambda λ )
TD(λ \lambda λ )에서처럼 eligibility traces 를 이용한다! (online인 경우에)
But Sarsa(λ \lambda λ ) has one eligibility trace for each state-action pair
E 0 ( s , a ) = 0 E_0(s,a)=0 E 0 ( s , a ) = 0
E t ( s , a ) = γ λ E t − 1 ( s , a ) + 1 ( S t = s , A t = a ) E_t(s,a)=\gamma\lambda{E_{t-1}(s,a)}+\bold{1}(S_t=s,A_t=a) E t ( s , a ) = γ λ E t − 1 ( s , a ) + 1 ( S t = s , A t = a )
매 step, 모든 ( s , a ) (s,a) ( s , a ) 에 대해 eligibility trace를 update 해준다.
For TD-error δ t \delta_t δ t and eligibility trace E t ( s , a ) E_t(s,a) E t ( s , a ) ,
δ t = R t + 1 + γ Q ( S t + 1 , A t + 1 ) − Q ( S t , A t ) \delta_t=R_{t+1}+\gamma{Q(S_{t+1},A_{t+1})}-Q(S_t,A_t) δ t = R t + 1 + γ Q ( S t + 1 , A t + 1 ) − Q ( S t , A t )
Q ( s , a ) ← Q ( s , a ) + α δ t E t ( s , a ) Q(s,a) \gets Q(s,a)+\alpha\delta_tE_t(s,a) Q ( s , a ) ← Q ( s , a ) + α δ t E t ( s , a )
Sarsa(λ \lambda λ ) Algorithm
Image from: here
Note: Q ( s , a ) , E ( s , a ) Q(s,a), E(s,a) Q ( s , a ) , E ( s , a ) 는 각각 s × a s\times a s × a table!
Sarsa(λ \lambda λ ) Gridworld Example
Image from:
here
Q Q Q 를 모두 0으로 initialize 했다면, one-step Sarsa의 경우 Q ( S ′ , ↑ ) Q(S', \uparrow) Q ( S ′ , ↑ ) 만 update 된다. (S ′ S' S ′ 는 목적지 바로 아래에 위치하는 state, 목적지에 도착하면 reward를 받는 상황)
Sarsa(λ \lambda λ )는 E ( s , a ) E(s,a) E ( s , a ) 에서 모든 state와 action에 대한 weight를 저장하고 있고 이 weight만큼 모든 s , a s, a s , a 에 대해 Q ( s , a ) Q(s,a) Q ( s , a ) 를 update 하기 때문에 마지막에만 reward가 발생해도 path의 모든 Q Q Q 들이 적절히 update 된다.
Off-Policy Learning
Evaluate target policy π ( a ∣ s ) \pi(a|s) π ( a ∣ s ) to compute v π ( s ) v_\pi(s) v π ( s ) or q π ( s , a ) q_\pi(s,a) q π ( s , a ) while following behaviour policy μ ( a ∣ s ) \mu(a|s) μ ( a ∣ s )
{ S 1 , A 1 , R 2 , . . . , S T } ∼ μ \{S_1,A_1,R_2,...,S_T\}\sim\mu { S 1 , A 1 , R 2 , . . . , S T } ∼ μ
Importance Sampling
E X ∼ P [ f ( X ) ] = ∑ P ( X ) f ( X ) = ∑ Q ( X ) P ( X ) Q ( X ) f ( X ) = E X ∼ Q [ P ( X ) Q ( X ) f ( X ) ] \mathbb{E}_{X \sim P}[f(X)]=\sum{P(X)f(X)}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum{Q(X)\cfrac{P(X)}{Q(X)}f(X)}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{X \sim Q}[\cfrac{P(X)}{Q(X)}f(X)] E X ∼ P [ f ( X ) ] = ∑ P ( X ) f ( X ) = ∑ Q ( X ) Q ( X ) P ( X ) f ( X ) = E X ∼ Q [ Q ( X ) P ( X ) f ( X ) ]
Under P P P 에서 f ( X ) f(X) f ( X ) 의 기댓값을 구하고자 하는데, P P P 에서 sampling을 하기 어려울 때, Q Q Q 의 sample을 대신 이용할 수 있다.
Importance Sampling for Off-Policy Monte-Carlo
G t G_t G t is generated from μ \mu μ
We want to evaluate π \pi π using μ \mu μ
To do this, we need to calculate G t π / μ G_t^{\pi/\mu} G t π / μ
By importance sampling,
G t π / μ = π ( A t ∣ S t ) π ( A t + 1 ∣ S t + 1 ) . . . π ( A T − 1 ∣ S T − 1 ) μ ( A t ∣ S t ) μ ( A t + 1 ∣ S t + 1 ) . . . μ ( A T − 1 ∣ S T − 1 ) G t G_t^{\pi/\mu}=\cfrac{\pi(A_t|S_t)\pi(A_{t+1}|S_{t+1})...\pi(A_{T-1}|S_{T-1})}{\mu(A_t|S_t)\mu(A_{t+1}|S_{t+1})...\mu(A_{T-1}|S_{T-1})}G_t G t π / μ = μ ( A t ∣ S t ) μ ( A t + 1 ∣ S t + 1 ) . . . μ ( A T − 1 ∣ S T − 1 ) π ( A t ∣ S t ) π ( A t + 1 ∣ S t + 1 ) . . . π ( A T − 1 ∣ S T − 1 ) G t
Note: Prob. of a trajectory
(μ \mu μ 로 Sampling 된 Trajectory의 확률을 각 policy에 대해 구한다.)
under π \pi π
P ( A t , S t + 1 , A t + 1 , . . . , S T ∣ S t : T , A t : T − 1 ∼ π ) = ∏ k = t T − 1 π ( A k ∣ S k ) P ( S k + 1 ∣ S k , A k ) P(A_t,S_{t+1},A_{t+1},...,S_T|S_{t:T},A_{t:T-1} \sim \pi)=\prod\limits_{k=t}^{T-1}{\pi(A_k|S_k)P(S_{k+1}|S_k,A_k)} P ( A t , S t + 1 , A t + 1 , . . . , S T ∣ S t : T , A t : T − 1 ∼ π ) = k = t ∏ T − 1 π ( A k ∣ S k ) P ( S k + 1 ∣ S k , A k )
under μ \mu μ
P ( A t , S t + 1 , A t + 1 , . . . , S T ∣ S t : T , A t : T − 1 ∼ μ ) = ∏ k = t T − 1 μ ( A k ∣ S k ) P ( S k + 1 ∣ S k , A k ) P(A_t,S_{t+1},A_{t+1},...,S_T|S_{t:T},A_{t:T-1} \sim \mu)=\prod\limits_{k=t}^{T-1}{\mu(A_k|S_k)P(S_{k+1}|S_k,A_k)} P ( A t , S t + 1 , A t + 1 , . . . , S T ∣ S t : T , A t : T − 1 ∼ μ ) = k = t ∏ T − 1 μ ( A k ∣ S k ) P ( S k + 1 ∣ S k , A k )
P ( S k + 1 ∣ S k , A k ) P(S_{k+1}|S_k,A_k) P ( S k + 1 ∣ S k , A k ) 는 주어진 trajectory와 env에 의해 이미 정해져 있고, policy의 영향을 받지 않기 때문에 나눠서 없앨 수 있다.
∴ G t π / μ = ∏ k = t T − 1 π ( A k ∣ S k ) ∏ k = t T − 1 μ ( A k ∣ S k ) G t \therefore G_t^{\pi/\mu}=\cfrac{\prod\limits_{k=t}^{T-1}{\pi(A_k|S_k)}}{\prod\limits_{k=t}^{T-1}{\mu(A_k|S_k)}}G_t ∴ G t π / μ = k = t ∏ T − 1 μ ( A k ∣ S k ) k = t ∏ T − 1 π ( A k ∣ S k ) G t
이렇게 구한 G t π / μ G_t^{\pi/\mu} G t π / μ 는 곧 return under π \pi π 라고 할 수 있다.
Update value towards the corrected return G t π / μ G_t^{\pi/\mu} G t π / μ
V ( S t ) ← V ( S t ) + α ( G t π / μ − V ( S t ) ) V(S_t) \gets V(S_t)+\alpha(G_t^{\pi/\mu}-V(S_t)) V ( S t ) ← V ( S t ) + α ( G t π / μ − V ( S t ) )
Note:
E [ G t ∣ S t = s ] = V μ ( s ) \mathbb{E}[G_t|S_t=s]=V_\mu(s) E [ G t ∣ S t = s ] = V μ ( s )
E [ G t π / μ ∣ S t = s ] = V π ( s ) \mathbb{E}[G_t^{\pi/\mu}|S_t=s]=V_\pi(s) E [ G t π / μ ∣ S t = s ] = V π ( s )
문제점:
분모에 들어가는 μ \mu μ 값이 0이 되면 쓸 수 없다.
Variance가 dramatically 증가 한다. (∵ \because ∵ variance of a ratio is unbounded)
Importance Sampling for Off-Policy TD
TD targets are generated from μ \mu μ
We want to evaluate π \pi π using μ \mu μ
For 1-step TD target, we only need a single importance sampling correction
TD target from μ \mu μ : R t + 1 + γ V ( S t + 1 ) R_{t+1}+\gamma{V(S_{t+1})} R t + 1 + γ V ( S t + 1 )
TD target corrected: π ( A t ∣ S t ) μ ( A t ∣ S t ) ( R t + 1 + γ V ( S t + 1 ) ) \cfrac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1}+\gamma{V(S_{t+1})}) μ ( A t ∣ S t ) π ( A t ∣ S t ) ( R t + 1 + γ V ( S t + 1 ) )
Update value towards the corrected TD target
V ( S t ) ← V ( S t ) + α ( π ( A t ∣ S t ) μ ( A t ∣ S t ) ( R t + 1 + γ V ( S t + 1 ) ) − V ( S t ) ) V(S_t) \gets V(S_t) + \alpha\left({\cfrac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1}+\gamma{V(S_{t+1})})-V(S_t)}\right) V ( S t ) ← V ( S t ) + α ( μ ( A t ∣ S t ) π ( A t ∣ S t ) ( R t + 1 + γ V ( S t + 1 ) ) − V ( S t ) )
MC에 비해 much lower variance! (여전히 variance가 높아지긴 한다.)
Off-Policy Evaluation by Q-learning
Now consider off-policy learning of action-values Q ( s , a ) Q(s,a) Q ( s , a )
(Policy evaluation by Q-learning)
No importance sampling is required
실제 next action A t A_{t} A t 는 μ \mu μ 를 이용해 선택하지만,
π \pi π 로 alternative(imaginary) action A ′ A' A ′ 을 선택해서 그 결과를 상상해본다.
"If I took A', what might it look like?"
Update Q ( S t , A t ) Q(S_t,A_t) Q ( S t , A t ) towards value of alternative action
(π \pi π 로 상상한 결과가 Q Q Q under policy π \pi π 에 더 가깝기 때문에 이 방향으로 update 해준다.)
Q ( S t , A t ) ← Q ( S t , A t ) + α ( R t + 1 + γ Q ( S t + 1 , A ′ ) − Q ( S t , A t ) ) Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t)) Q ( S t , A t ) ← Q ( S t , A t ) + α ( R t + 1 + γ Q ( S t + 1 , A ′ ) − Q ( S t , A t ) )
Off-Policy Control by Q-learning
Behaviour policy(μ \mu μ ) & Target policy(π \pi π )를 둘 다 update!
(물론 Behaviour policy를 update 할 수 없는 경우도 있지만...)
The target policy π \pi π is greedy w.r.t. Q ( s , a ) Q(s,a) Q ( s , a )
=> π ( S t + 1 ) = arg max a ′ Q ( S t + 1 , a ′ ) \pi(S_{t+1})=\argmax\limits_{a'}{Q(S_{t+1}, a')} π ( S t + 1 ) = a ′ a r g m a x Q ( S t + 1 , a ′ )
The Q-learning target then simplifies:
R t + 1 + γ Q ( S t + 1 , A ′ ) R_{t+1}+ \gamma Q(S_{t+1}, A') R t + 1 + γ Q ( S t + 1 , A ′ )
= R t + 1 + γ Q ( S t + 1 , arg max a ′ Q ( S t + 1 , a ′ ) ) =R_{t+1}+ \gamma Q(S_{t+1}, \argmax\limits_{a'}{Q(S_{t+1}, a')}) = R t + 1 + γ Q ( S t + 1 , a ′ a r g m a x Q ( S t + 1 , a ′ ) )
= R t + 1 + max a ′ γ Q ( S t + 1 , a ′ ) =R_{t+1}+ \max\limits_{a'} \gamma Q(S_{t+1}, a') = R t + 1 + a ′ max γ Q ( S t + 1 , a ′ )
Q ( S t , A t ) ← Q ( S t , A t ) + α ( R t + 1 + γ max a ′ Q ( S t + 1 , a ′ ) − Q ( S t , A t ) ) Q(S_t,A_t) \gets Q(S_t,A_t) + \alpha\left(R_{t+1}+\gamma\max\limits_{a'}{Q(S_{t+1}, a')}-Q(S_t,A_t)\right) Q ( S t , A t ) ← Q ( S t , A t ) + α ( R t + 1 + γ a ′ max Q ( S t + 1 , a ′ ) − Q ( S t , A t ) )
The behaviour policy μ \mu μ is e.g. ϵ \epsilon ϵ -greedy w.r.t. Q ( s , a ) Q(s,a) Q ( s , a )
To make sure we visit all possible pairs of (s, a)
Theorem
Q-learning control converges to the optimal action-value function, i.e., Q ( s , a ) → q ∗ ( s , a ) Q(s,a) \to q_*(s,a) Q ( s , a ) → q ∗ ( s , a )
Q-Learning Algorithm for off-policy control
Image from: here
Relationship Between DP and TD
Let X ← α Y ≡ X ← X + α ( Y − X ) X \xleftarrow{\alpha}{Y} \equiv X \gets X + \alpha(Y-X) X α Y ≡ X ← X + α ( Y − X )
Image from: here
Note:
Sarsa 로 On-policy Control 할 때, Sarsa는 Q-Policy evaluation에만 쓰이고, improvement는 (Sarsa로 얻은) Q Q Q 를 보고 ϵ \epsilon ϵ -greedy action을 취한다.
반면 Q-learning 으로 Off-policy Control을 할 때는 우리가 evaluate & improve 하는 policy π \pi π 로부터 action을 취하는 게 아니기 때문에 (μ \mu μ 가 action을 준다.) explicit policy를 얻을 필요가 없다.
=> Value Iteration과 유사하게, 매 step마다 implicitly greedy policy를 선택하고 있다.
혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!