Silver RL (5) Model-Free Control

Sanghyeok Choi·2021년 12월 29일

Intro_to_RL

목록 보기

5/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 5: Model-Free Control (Youtube) 강의 내용을 정리했습니다.

Introduction

Model-Free Reinforcement Learning

지난 시간에 Model-free prediction을 배움
(Estimate the value function of an unknown MDP)
이번 시간엔 Model-free control을 배울 예정
(Optimize the value function of an unknown MDP)

Uses of Model Free Control
- MDP model is unknown, but experience can be sampled
  (e.g. Portfolio management, Game of Go)
- MDP model is known, but is too big to use, except by samples
  (e.g. Helicopter, Ship Steering)

On and Off-Policy Learning

On-policy learning
- "Learn on the job"
- Learn about policy $\pi$ from experience sampled from $\pi$
Off-policy learning
- "Look over someone's shoulder"
- Learn about policy $\pi$ form experience sampled from $\mu$
  (e.g. 로봇을 움직이기 위해 사람의 움직임(=better policy)을 학습!)

On-Policy Monte-Carlo Control

Generalised Policy Iteration (Lec. 3)

Recall!

Image from: here

Model is known
- Policy Evaluation: Estimate $\mathrm{v}_\pi$
- Policy Improvement: Generate $\pi' \geq \pi$

Generalised Policy Iteration With Monte-Carlo Evaluation

Now, Model is unknown!
- Policy Evaluation: MC policy evaluation (4장), $V=\mathrm{v}_\pi$ ?
  No! V는 sample로부터 얻었기 때문에 $\mathrm{v}_\pi$ 와 같다는 게 보장되지 않음! (approximation이다.)
- Policy Improvement: Greedy policy improvement?

Model-Free Policy Iteration Using Action-Value Function

Greedy policy improvement over $V(s)$ requires model of MDP
- Recall, $\pi'(s)=\argmax\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}V(s')]$
  Note: $\mathcal{R}^a_s$ & $\mathcal{P}^a_{ss'}$ 를 안다 $\iff$ MDP를 안다.
- $\mathcal{P}$ 도 sample을 통해 근사할 수 있지만(relative frequency로), too expensive! (state $\times$ action에 대해 근사시켜야 하기 때문) 게다가 이렇게 하는 건 Model을 build 하는 것이기 때문에 Model-Free의 취지에 맞지 않다!
- $V(s')$ 는 상대적으로 부정확한 추정치(for $v_{\pi}(s')$ )이다.
  $\rarr$ $\mathcal{P}$ 도 추정치, $V$ 도 추정치라서 곱해졌을 때 error가 더 크다. (내 생각)
Greedy policy improvement over $Q(s,a)$ is model-free!
- $\pi'(s)=\argmax\limits_{a\in\mathcal{A}}Q(s,a)$
- Here, $Q(s,a)$ is estimated action value function
  $Q(s,a)=\frac{1}{N(s,a)}\sum{G_t}$
  Note: $G_t$ is an empirical return, with $R_{t+1}=R_{s}^{a}$
- $Q(s,a)$ 는 상대적으로 정확한 추정치(for $q_{\pi}(s,a)$ )이다.
  ( $\because R_t$ 가 state $S_t$ 에서의 action $A_t$ 로부터 얻어지기 때문)

정리하자면, Control에서는 state value function $V(s)$ 를 쓰면 improvement를 하기가 어렵다.
그래서 action value function $Q(s,a)$ 를 사용해서 evaluate하고, improvement를 한다.

Generalised Policy Iteration with Action-Value Function

Image from: here

Policy Evaluation: Monte-Carlo policy evaluation $Q = q_\pi$
- Lec. 4에서 배웠던 policy evaluation을 Q에 대해 적용한다.
- Or using incremental mean, for each state $S_t$ and action $A_t$ in an episode,
  $N(S_t,A_t) \gets N(S_t,A_t)+1$
  $Q(S_t,A_t) \gets Q(S_t,A_t)+\frac{1}{N(S_t,A_t)}(G_t-Q(S_t,A_t))$
Policy Improvement: Greedy policy improvement?
- With greedy policy, we can stuck into local optimum.
  (MC에서 sampling 되지 않은 (s,a) pair가 optimal일 수도 있기 때문!)
- We need to make sure we continue to see everything!

Exploration: $\epsilon$ -Greedy Policy Improvement

With prob. $1-\epsilon$ , choose the greedy action
With prob. $\epsilon$ , choose an action at random
$\pi(a|s)=\begin{cases} \epsilon/m+1-\epsilon \space\space\space\space\space\space if \space\space a^*=\argmax\limits_{a\in\mathcal{A}}Q(s,a)\\ \epsilon/m \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space otherwise \end{cases}$

Theorem
For any $\epsilon$ -greedy policy $\pi$ , the $\epsilon$ -greedy policy $\pi'$ w.r.t. $q_\pi$ is an improvement,
i.e., $v_{\pi'}(s) \geq v_{\pi}(s)$

Proof)
$q_\pi(s, \pi'(s))=\sum\limits_{a\in\mathcal{A}}\pi'(a|s)q_\pi(s,a)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space = \epsilon/m\sum\limits_{a\in\mathcal{A}}[q_\pi(s,a)+(1-\epsilon)\max\limits_{a\in\mathcal{A}}q_\pi(s,a)]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space \geq \epsilon/m\sum\limits_{a\in\mathcal{A}}[q_\pi(s,a)+(1-\epsilon)\sum\limits_{a\in\mathcal{A}}\frac{\pi(a|s)-\epsilon/m}{1-\epsilon}q_\pi(s,a)]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space = \sum\limits_{a\in\mathcal{A}}\pi(a|s)q_\pi(s,a) = v_\pi(s)$
$\therefore v_{\pi'}(s) \geq v_\pi(s)$ ... as $v_\pi(s) \geq q_\pi(s, \pi'(s))$ from policy improvement theorem(Lec.3).

Monte-Carlo Policy Iteration

Image from: here

Policy Evaluation: Monte-Carlo policy evaluation $Q = q_\pi$
Policy Improvement: $\epsilon$ -greedy policy improvement

Monte-Carlo Policy Control

Image from: here

정확한 $Q$ 를 얻기 위해서는 many episodes을 sample 해야한다.
But 우리의 목적은 optimal solution을 찾는 것이므로, episode-by-episode 방식으로 update 하는 게 더 효율적이다.
At Every episode
- Policy evaluation: Monte-Carlo policy evaluation, $Q \approx q_\pi$
- Policy improvement: $\epsilon$ -greedy policy improvement

GLIE

Greedy in the Limit with Infinite Exploration (GLIE) property

All state-action pairs are explored infinitely many times
$\lim\limits_{k\to\infin}N_k(s,a)=\infin$

The policy converges on a greedy policy
$\lim\limits_{k\to\infin}\pi_k(a|s)=\bold{1}(a=\argmax\limits_{a'\in\mathcal{A}}Q_k(s,a'))$

GLIE property를 만족하도록 학습해야 한다!
- $\epsilon$ -greedy is GLIE if $\epsilon$ reduces to zero at $\epsilon_k=\frac{1}{k}$

GLIE Monte-Carlo Control

Sample $k$ th episode using $\pi$ : $\{S_1,A_1,R_2,...,S_T\}\sim\pi$
For each state $S_t$ and action $A_t$ in the episode,
$N(S_t,A_t) \gets N(S_t,A_t)+1$
$Q(S_t,A_t) \gets Q(S_t,A_t)+\frac{1}{N(S_t,A_t)}(G_t-Q(S_t,A_t))$
Improve policy based on new action-value function
$\epsilon \gets 1/k$
$\pi \gets \epsilon$ -greedy $(Q)$

Theorem
GLIE Monte-Carlo control converges to the optimal action-value function,
i.e., $Q(s,a) \to q_*(s,a)$

On-Policy Temporal-Difference Learning

MC vs. TD Control

Advantages of TD learning over MC
- Lower variance
- Can learn online (Incomplete sequences)
Natural Idea: use TD instead of MC in our control loop
- Apply TD to compute $Q(S,A)$
- Use $\epsilon$ -greedy policy improvement

Updating Action-Value Functions with Sarsa

Lec. 4에서 TD로 $V$ 를 estimate 했다.
Sarsa는 TD로 $Q$ 를 estimate 하는 것!
Sarsa:
- $(S,A)$ ... 현재 state & action
- $R$ ......... reward
- $S'$ ........ 다음 state
- $A'$ ........ 다음 action (determined by some policy $\pi$ )
$Q(S,A) \gets Q(S,A)+\alpha(R+\gamma{Q(S',A')}-Q(S,A))$
- Here,
  $Q(S,A)$ is an estimation before taking a step
  $R+\gamma{Q(S',A')}$ is an estimation after taking a step
MC control에서 Q를 구하는 방법만 위와 같이 바꾸면 Sarsa가 된다!

On-Policy Control With Sarsa

Every time-step (MC는 every episode)
- Policy Evaluation: Sarsa, $Q \approx q_\pi$
- Policy Improvement: $\epsilon$ -greedy policy improvement
Sarsa Algorithm for On-Policy Control
Image from: here
- Note: $Q(s,a)$ 는 $s\times a$ table!

Theorem (Convergence of Sarsa)
Sarsa converges to the optimal action-value function under the following conditions:

GLIE sequence of policies $\pi_t(a|s)$

Robbins-Monro sequence of step-sizes $\alpha_t$
$\sum\limits_{t=1}^{\infin}\alpha_t=\infin$
$\sum\limits_{t=1}^{\infin}\alpha_t^2\lt\infin$
=> Step size should be not too small or big!

n-Step Sarsa

앞에서 본 TD-target은 1-step Q-return!
Consider n-step returns for $n=1,2,...,\infin$
$n=1 \space\space\space\space\space q_t^{(1)}=R_{t+1}+\gamma{Q(S_{t+1},A_{t+1})}$
$n=2 \space\space\space\space\space q_t^{(2)}=R_{t+1}+\gamma{R_{t+2}}+\gamma^2{Q(S_{t+2},A_{t+2})}$
$\vdots \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space \vdots$
$n=\infin \space\space\space q_t^{(\infin)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{T-t-1}R_T$
Define the n-step Q-return
$q_t^{(n)}=R_{t+1}+\gamma{R_{t+2}}+...+\gamma^{n-1}{R_{t+n}}+\gamma^n{Q(S_{t+n},A_{t+n})}$
이게 TD-target이 된다!
n-step Sarsa updates $Q(s,a)$ towards the n-step $Q$ -return
$Q(S_t,A_t) \gets Q(S_t,A_t)+\alpha(q_t^{(n)}-Q(S_t,A_t))$

Forward View Sarsa( $\lambda$ )

The $q^\lambda_t$ return combines all n-step $Q$ -returns $q^{(n)}_t$
Using weight $(1-\lambda)\lambda^{n-1}$
$q^\lambda_t=(1-\lambda)\sum\limits_{n=1}^{\infin}\lambda^{n-1}q_t^{(n)}$
Forward-view Sarsa( $\lambda$ )
$Q(S_t,A_t) \gets Q(S_t,A_t)+\alpha(q^\lambda_t-Q(S_t,A_t))$

Backward View Sarsa( $\lambda$ )

TD( $\lambda$ )에서처럼 eligibility traces를 이용한다! (online인 경우에)
But Sarsa( $\lambda$ ) has one eligibility trace for each state-action pair
$E_0(s,a)=0$
$E_t(s,a)=\gamma\lambda{E_{t-1}(s,a)}+\bold{1}(S_t=s,A_t=a)$
매 step, 모든 $(s,a)$ 에 대해 eligibility trace를 update 해준다.
For TD-error $\delta_t$ and eligibility trace $E_t(s,a)$ ,
$\delta_t=R_{t+1}+\gamma{Q(S_{t+1},A_{t+1})}-Q(S_t,A_t)$
$Q(s,a) \gets Q(s,a)+\alpha\delta_tE_t(s,a)$

Sarsa( $\lambda$ ) Algorithm
Image from: here
- Note: $Q(s,a), E(s,a)$ 는 각각 $s\times a$ table!

Sarsa( $\lambda$ ) Gridworld Example

Image from: here

$Q$ 를 모두 0으로 initialize 했다면, one-step Sarsa의 경우 $Q(S', \uparrow)$ 만 update 된다. ( $S'$ 는 목적지 바로 아래에 위치하는 state, 목적지에 도착하면 reward를 받는 상황)
Sarsa( $\lambda$ )는 $E(s,a)$ 에서 모든 state와 action에 대한 weight를 저장하고 있고 이 weight만큼 모든 $s, a$ 에 대해 $Q(s,a)$ 를 update 하기 때문에 마지막에만 reward가 발생해도 path의 모든 $Q$ 들이 적절히 update 된다.

Off-Policy Learning

Evaluate target policy $\pi(a|s)$ to compute $v_\pi(s)$ or $q_\pi(s,a)$ while following behaviour policy $\mu(a|s)$
$\{S_1,A_1,R_2,...,S_T\}\sim\mu$

Importance Sampling

$\mathbb{E}_{X \sim P}[f(X)]=\sum{P(X)f(X)}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum{Q(X)\cfrac{P(X)}{Q(X)}f(X)}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{X \sim Q}[\cfrac{P(X)}{Q(X)}f(X)]$

Under $P$ 에서 $f(X)$ 의 기댓값을 구하고자 하는데, $P$ 에서 sampling을 하기 어려울 때, $Q$ 의 sample을 대신 이용할 수 있다.

Importance Sampling for Off-Policy Monte-Carlo

$G_t$ is generated from $\mu$
We want to evaluate $\pi$ using $\mu$
To do this, we need to calculate $G_t^{\pi/\mu}$
By importance sampling,
$G_t^{\pi/\mu}=\cfrac{\pi(A_t|S_t)\pi(A_{t+1}|S_{t+1})...\pi(A_{T-1}|S_{T-1})}{\mu(A_t|S_t)\mu(A_{t+1}|S_{t+1})...\mu(A_{T-1}|S_{T-1})}G_t$
- Note: Prob. of a trajectory
  ( $\mu$ 로 Sampling 된 Trajectory의 확률을 각 policy에 대해 구한다.)
  - under $\pi$
    $P(A_t,S_{t+1},A_{t+1},...,S_T|S_{t:T},A_{t:T-1} \sim \pi)=\prod\limits_{k=t}^{T-1}{\pi(A_k|S_k)P(S_{k+1}|S_k,A_k)}$
  - under $\mu$
    $P(A_t,S_{t+1},A_{t+1},...,S_T|S_{t:T},A_{t:T-1} \sim \mu)=\prod\limits_{k=t}^{T-1}{\mu(A_k|S_k)P(S_{k+1}|S_k,A_k)}$
- $P(S_{k+1}|S_k,A_k)$ 는 주어진 trajectory와 env에 의해 이미 정해져 있고, policy의 영향을 받지 않기 때문에 나눠서 없앨 수 있다.
  $\therefore G_t^{\pi/\mu}=\cfrac{\prod\limits_{k=t}^{T-1}{\pi(A_k|S_k)}}{\prod\limits_{k=t}^{T-1}{\mu(A_k|S_k)}}G_t$
이렇게 구한 $G_t^{\pi/\mu}$ 는 곧 return under $\pi$ 라고 할 수 있다.
Update value towards the corrected return $G_t^{\pi/\mu}$
$V(S_t) \gets V(S_t)+\alpha(G_t^{\pi/\mu}-V(S_t))$

Note:
$\mathbb{E}[G_t|S_t=s]=V_\mu(s)$
$\mathbb{E}[G_t^{\pi/\mu}|S_t=s]=V_\pi(s)$

문제점:
- 분모에 들어가는 $\mu$ 값이 0이 되면 쓸 수 없다.
- Variance가 dramatically 증가한다. ( $\because$ variance of a ratio is unbounded)

Importance Sampling for Off-Policy TD

TD targets are generated from $\mu$
We want to evaluate $\pi$ using $\mu$
For 1-step TD target, we only need a single importance sampling correction
TD target from $\mu$ : $R_{t+1}+\gamma{V(S_{t+1})}$
TD target corrected: $\cfrac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1}+\gamma{V(S_{t+1})})$
Update value towards the corrected TD target
$V(S_t) \gets V(S_t) + \alpha\left({\cfrac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1}+\gamma{V(S_{t+1})})-V(S_t)}\right)$
MC에 비해 much lower variance! (여전히 variance가 높아지긴 한다.)

Off-Policy Evaluation by Q-learning

Now consider off-policy learning of action-values $Q(s,a)$
(Policy evaluation by Q-learning)
No importance sampling is required
실제 next action $A_{t}$ 는 $\mu$ 를 이용해 선택하지만,
$\pi$ 로 alternative(imaginary) action $A'$ 을 선택해서 그 결과를 상상해본다.
"If I took A', what might it look like?"
Update $Q(S_t,A_t)$ towards value of alternative action
( $\pi$ 로 상상한 결과가 $Q$ under policy $\pi$ 에 더 가깝기 때문에 이 방향으로 update 해준다.)
$Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t))$

Off-Policy Control by Q-learning

Behaviour policy( $\mu$ ) & Target policy( $\pi$ )를 둘 다 update!
(물론 Behaviour policy를 update 할 수 없는 경우도 있지만...)
- The target policy $\pi$ is greedy w.r.t. $Q(s,a)$
  => $\pi(S_{t+1})=\argmax\limits_{a'}{Q(S_{t+1}, a')}$
  - The Q-learning target then simplifies:
    $R_{t+1}+ \gamma Q(S_{t+1}, A')$
    $=R_{t+1}+ \gamma Q(S_{t+1}, \argmax\limits_{a'}{Q(S_{t+1}, a')})$
    $=R_{t+1}+ \max\limits_{a'} \gamma Q(S_{t+1}, a')$
  - $Q(S_t,A_t) \gets Q(S_t,A_t) + \alpha\left(R_{t+1}+\gamma\max\limits_{a'}{Q(S_{t+1}, a')}-Q(S_t,A_t)\right)$
- The behaviour policy $\mu$ is e.g. $\epsilon$ -greedy w.r.t. $Q(s,a)$
  To make sure we visit all possible pairs of (s, a)

Theorem
Q-learning control converges to the optimal action-value function, i.e., $Q(s,a) \to q_*(s,a)$

Q-Learning Algorithm for off-policy control
Image from: here

Relationship Between DP and TD

Let $X \xleftarrow{\alpha}{Y} \equiv X \gets X + \alpha(Y-X)$
Image from: here
Note:
- Sarsa로 On-policy Control 할 때, Sarsa는 Q-Policy evaluation에만 쓰이고, improvement는 (Sarsa로 얻은) $Q$ 를 보고 $\epsilon$ -greedy action을 취한다.
- 반면 Q-learning으로 Off-policy Control을 할 때는 우리가 evaluate & improve 하는 policy $\pi$ 로부터 action을 취하는 게 아니기 때문에 ( $\mu$ 가 action을 준다.) explicit policy를 얻을 필요가 없다.
  => Value Iteration과 유사하게, 매 step마다 implicitly greedy policy를 선택하고 있다.

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

Sanghyeok Choi

Lazy Enthusiast

이전 포스트

Silver RL (4) Model-Free Prediction

다음 포스트

Silver RL (5) Model-Free Control

Intro_to_RL

Introduction

Model-Free Reinforcement Learning

On and Off-Policy Learning

On-Policy Monte-Carlo Control

Generalised Policy Iteration (Lec. 3)

Generalised Policy Iteration With Monte-Carlo Evaluation

Model-Free Policy Iteration Using Action-Value Function

Generalised Policy Iteration with Action-Value Function

Exploration: $\epsilon$ -Greedy Policy Improvement

Monte-Carlo Policy Iteration

Monte-Carlo Policy Control

GLIE

GLIE Monte-Carlo Control

On-Policy Temporal-Difference Learning

MC vs. TD Control

Updating Action-Value Functions with Sarsa

On-Policy Control With Sarsa

n-Step Sarsa

Forward View Sarsa( $\lambda$ )

Backward View Sarsa( $\lambda$ )

Sarsa( $\lambda$ ) Gridworld Example

Off-Policy Learning

Importance Sampling

Importance Sampling for Off-Policy Monte-Carlo

Importance Sampling for Off-Policy TD

Off-Policy Evaluation by Q-learning

Off-Policy Control by Q-learning

Relationship Between DP and TD

Silver RL (4) Model-Free Prediction

Silver RL (6) Value Function Approximation

0개의 댓글

관련 채용 정보

Silver RL (5) Model-Free Control

Intro_to_RL

Introduction

Model-Free Reinforcement Learning

On and Off-Policy Learning

On-Policy Monte-Carlo Control

Generalised Policy Iteration (Lec. 3)

Generalised Policy Iteration With Monte-Carlo Evaluation

Model-Free Policy Iteration Using Action-Value Function

Generalised Policy Iteration with Action-Value Function

Exploration: ϵ\epsilonϵ-Greedy Policy Improvement

Monte-Carlo Policy Iteration

Monte-Carlo Policy Control

GLIE

GLIE Monte-Carlo Control

On-Policy Temporal-Difference Learning

MC vs. TD Control

Updating Action-Value Functions with Sarsa

On-Policy Control With Sarsa

n-Step Sarsa

Forward View Sarsa(λ\lambdaλ)

Backward View Sarsa(λ\lambdaλ)

Sarsa(λ\lambdaλ) Gridworld Example

Off-Policy Learning

Importance Sampling

Importance Sampling for Off-Policy Monte-Carlo

Importance Sampling for Off-Policy TD

Off-Policy Evaluation by Q-learning

Off-Policy Control by Q-learning

Relationship Between DP and TD

Silver RL (4) Model-Free Prediction

Silver RL (6) Value Function Approximation

0개의 댓글

관련 채용 정보

Exploration: $\epsilon$ -Greedy Policy Improvement

Forward View Sarsa( $\lambda$ )

Backward View Sarsa( $\lambda$ )

Sarsa( $\lambda$ ) Gridworld Example