Silver RL (3) Planning by Dynamic Programming

Sanghyeok Choi·2021년 12월 23일

Intro_to_RL

목록 보기

3/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 3: Planning by Dynamic Programming (Youtube) 강의 내용을 정리했습니다.

Introduction

What is Dynamic Programming (DP)?

Dynamic: sequential or temporal component to the problem
Programming: optimizing a "program", i.e. a policy
Dynamic Programming: Optimization for sequential problem by breaking them down into subproblems
RL $\subset$ DP

Requirements for Dynamic Programming

1) Optimal substructure
- Principle of optimality applies
- Principle of optimality: optimal solution can be decomposed into subproblems
2) Overlapping subproblems
- Subproblems recur many times
- Solutions can be cached and reused
MDP는 위 두 조건을 만족한다.
- Bellman equation gives recursive decomposition
  $v(s)=\mathbb{E}[R_{t+1}+\gamma{v(s_{t+1})|S_t=s}]$
- Value function stores and reuses solutions

Planning by Dynamic Programming

Recall, planning is not a 'full' RL problem.
- Planning: The environment is fully known
- Reinforcement Learning: The environment is initially unknown
DP assumes full knowledge of the MDP, and used for planning in the MDP
- MDP를 안다 <=> Environment를 안다
Prediction = (Policy) Evaluation
- Policy가 정해져 있을 때 그 policy가 얼마나 좋은지 평가한다.
- Input: MDP $\lang \mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma \rang$ and policy $\pi$
  or: MRP $\lang \mathcal{S},\mathcal{P}^\pi,\mathcal{R}^\pi,\gamma\rang$
- Output: value function $v_\pi$
Control
- Optimal policy를 찾는다.
- Input: MDP $\lang \mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma \rang$
- Output: optimal value function $v_*$ and optimal policy $\pi_*$
"Solving MDP"라고 하면 보통 Control 문제를 푸는 걸 의미함

Policy Evaluation

Iterative Policy Evaluation

Evaluate a given policy $\pi$ by iterative application of Bellman expectation backup
- State-value function, $\mathrm{v}_\pi$ 를 구하는 과정
- Recall, Bellman expectation eqn':
  $v_{\pi}(s)=\mathbb{E}_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)q_{\pi}(s,a)}\\ \space\space\space\space\space\space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{a}_{ss'}v_{\pi}(s'))}$
- In matrix form, $\mathrm{v}_\pi=\mathcal{R}^\pi+\gamma\mathcal{P}^\pi\mathrm{v}_\pi$
  => Direct Solution: $\mathrm{v}_\pi=(I-\gamma\mathcal{P}^\pi)^{-1}\mathcal{R}^\pi$
  (inverse의 계산복잡도가 너무 커서 DP를 이용해 $\mathrm{v}_\pi$ 를 구한다!)
$\mathrm{v}_1\to\mathrm{v}_2\to\mathrm{v}_3\to...\to\mathrm{v}_\pi$
Arbitrary vector $\mathrm{v}_1$ 을 iteratively update, $\mathrm{v}_\pi$ 로 수렴!
Note: $\mathrm{v}_k=[v_k(1), v_k(2), ..., v_k(n)]^T$
$v_{k+1}(s)=\sum\limits_{a\in\mathcal{A}}\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_k(s')})$
- Bellman equation을 iteratively 적용해서 $\mathrm{v}_k$ 로 $\mathrm{v}_{k+1}$ 을 구할 수 있다!
- 계속 하다 보면 $\mathrm{v}_\pi$ 로 수렴한다. (증명은 뒤에서 할 예정)

Example: Iterative Policy Evaluation in Small Gridworld

규칙
- Agent follows uniform random policy (단, grid 밖으로는 움직일 수 없다.)
- 좌상단, 우하단 grid에 도착하면 terminate
- Reward is -1 every step until the terminal state is reached
  (가능한 빨리 terminal state에 도착하는 게 좋다)
- $\gamma=1.0$

Image from: here

빨간색 동그라미 표시한 곳에 위치하는 state를 s라고 하면,
$v_2(s)=\mathcal{R}^\pi+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^\pi_{ss'}v_1(s')\\ \space\space\space\space\space\space\space\space\space\space=(-1.0)+(1.0)\times[\frac{1}{3}\times0+\frac{2}{3}\times(-1.0)]\\ \space\space\space\space\space\space\space\space\space\space=-1.666... \approx-1.7$

Policy Iteration

Given a policy $\pi$
- Evaluate the policy $\pi$ => $\mathrm{v}_\pi(s)$
- Improve the policy by acting greedily w.r.t $\mathrm{v}_\pi$ => $\pi'=greedy(\mathrm{v}_\pi)$
이 과정을 반복하면 $\pi^*$ 으로 수렴한다.

Image from: here

1) Policy evaluation : Itrative policy evaluation
2) Policy improvement : Greedy policy improvement

Policy Improvement

First, consider a deterministic policy, a= $\pi(s)$ ... state s에서는 action a를 한다.
By acting greedily,
$\pi'(s)=\argmax\limits_{a\in\mathcal{A}}{q_\pi(s,a)}=\argmax\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_\pi(s')]=a_*$
Then,
$q_\pi(s,\pi'(s))=q_\pi(s,a_*)=\max\limits_{a\in\mathcal{A}}{q_\pi(s,a)} \geq q_\pi(s,\pi(s))=v_\pi(s)$
Note: 부등식 좌변은 s에서 한 번만 $\pi'$ 으로 action하고 그 이후로는 $\pi$ 를 따를 때의 value
It therefore improves the value function
$v_\pi(s) \leq q_\pi(s,\pi'(s))=\mathbb{E}_{\pi'}[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma q_\pi(S_{t+1}, \pi'(S_{t+1}))|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2q_\pi(S_{t+2},\pi'(S_{t+2})|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\gamma^3R_{t+4}+...|S_t=s]=v_{\pi'}(s)$
i.e., greedily updated new policy is at least equal to previous policy!
If improvements stop, i.e.,
$q_\pi(s,\pi'(s))=\max\limits_{a\in\mathcal{A}}{q_\pi(s,a)}=q_\pi(s,\pi(s))=v_\pi(s), \forall s\in\mathcal{S}$
즉, improvement를 해도 action-value가 그대로인 경우 (e.g. $\pi'(s)=\pi(s))$
then the Bellman optimality equation has been satisfied, so $\pi$ is an optimal policy, i.e.,
$v_\pi(s)=\max\limits_{a\in\mathcal{A}}{q_\pi(s,a)}, \forall s\in\mathcal{S} \implies v_\pi(s)=v_*(s), \forall s\in\mathcal{S}$
- Recall, optimal policy의 필요충분조건 (2장)
  $v_\pi(s)=\max\limits_{a\in\mathcal{A}}q_\pi(s,a), \forall s\in\mathcal{S} \iff \pi$ is an optimal policy

Generalized Policy Iteration

1) Policy evaluation : Any policy evaluation algorithm
2) Policy improvement : Any policy improvement algorithm
Policy evaluation를 $\mathrm{v}_\pi$ 에 수렴할 때까지 할 필요는 없다!
=> stop after k iterations of iterative policy evaluation!
=> 참고로 k=1일때 value iteration이 됨! ( $\mathrm{v}$ 를 update 할 때마다 policy도 update)

Value Iteration

Principle of Optimality Theorem

An optimal solution can be decomposed into subproblems!

A policy $\pi(a|s)$ achieves the optimal value from state s, i.e., $v_\pi(s)=v_*(s)$ , if and only if:
for any state $s'$ reachable from $s$ , $\pi$ achieves the optimal value from state $s'$ , i.e., $v_\pi(s')=v_*(s')$
=> 모든 가능한 next states에서 optimal인 policy는 현재 state에서도 optimal이다. (An optimal first action을 찾을 수 있음)
=> 마찬가지로 $\pi$ 가 s에서 optimal이기 위해서는 $\pi$ 가 $s'$ 에서도 optimal이어야 한다. ( $v_\pi(s)$ 에 future reward도 포함되기 때문)

c.f. Bellman's Principle of Optimality:
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

Deterministic Value Iteration

Principle of Optimality Theorem의 $\impliedby$ 방향을 이용한다.
- We know the solution to subproblems $v_*(s'), \forall s' \in S'$ ... ( $S'$ is a set of successor states of $s$ )
- Then solution $v_*(s)$ can be found by one-step lookahead.
  $v_*(s)\gets\max\limits_{a\in\mathcal{A}}{[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_*(s')}]}$
Find optimal policy by iterative application of Bellman optimality backup
- Arbitrarily initialize $\mathrm{v}_1$
- $v_{k+1}(s) \gets \max\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_k(s')]$
  이걸 모든 state에 대해 동시에, iteratively 적용하면 $\mathrm{v}_*$ 로 수렴한다.
  In a matrix form, $\mathrm{v}_{k+1}=\max\limits_{a\in\mathcal{A}}{[\mathcal{R}^a+\gamma\mathcal{P}^a\mathrm{v}_k]}$
- $\mathrm{v}_1 \to \mathrm{v}_2 \to ... \to \mathrm{v_*}$
- Unlike policy iteration, there is no explicit policy
  update 할 때 max를 취하기 때문에 매 step마다 greedy action을 선택한다고 볼 수 있다.
  (=Policy iteration with 1 step policy evaluation)
- $\mathrm{v}_*$ 로 greedy policy를 구하면 optimal policy가 된다.
  $\pi^*(s)=\argmax\limits_{a\in\mathcal{A}}{q_*(s,a)}=\argmax\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_*(s')]$

Policy Iteration vs Value Iteration

Policy iteration:
- Policy evaluation
  $v_{k+1}(s)=\sum\limits_{a\in\mathcal{A}}\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_k(s')})$
  $\mathrm{v}_1 \to \mathrm{v}_2 \to ... \to \mathrm{v_\pi}$
- Policy improvement
  $\pi'(s)=\argmax\limits_{a\in\mathcal{A}}{q_\pi(s,a)}=\argmax\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_\pi(s')]=a_*$
Value iteration:
- $v_{k+1}(s) \gets \max\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_k(s')]$
- $\mathrm{v}_1 \to \mathrm{v}_2 \to ... \to \mathrm{v_*}$
Show policy iteration with 1 step policy evaluation = value iteration:
- 1 step policy evaluation
  $v_{k}(s)=\sum\limits_{a\in\mathcal{A}}\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_{k-1}(s')})$
- Policy improvement
  $\pi'(s)=\argmax\limits_{a\in\mathcal{A}}{q_\pi(s,a)}=\argmax\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_{k}(s')]=a_*$
- Again, 1 step policy evaluation:
  $v_{k+1}(s)=\sum\limits_{a\in\mathcal{A}}\pi'(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_k(s')})\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathcal{R}^{a_*}_s+\gamma\sum\limits_{s\in\mathcal{S}}{\mathcal{P}^{a_*}_{ss'}v_k(s')}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\max\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_k(s')]$

Example: Shortest Path

Image from: here

빨간색, 파란색 동그라미 표시한 곳에 위치하는 state를 각각 $s_r$ , $s_b$ 라고 하면,
- $v_{3}(s_r)=\max\limits_{a}{[\mathcal{R}^a_{s_r}+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{s_rs'}v_2(s')}]}=-1+1.0\times0=-1$
- $v_{3}(s_b)=\max\limits_{a}{[\mathcal{R}^a_{s_b}+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{s_bs'}v_2(s')}]}=-1+1.0\times(-1)=-2$

Summary of DP Algorithms

Image from: here

지금까지는 state-value function으로 계산했다. ( $v_\pi(s)$ or $v_*(s)$ )
- Complexity: $O(mn^2)$ per iteration, for m actions and n states
똑같은 logic을 action-value function에 적용할 수도 있다. ( $q_\pi(s,a)$ or $q_*(s,a)$ )
- Complexity: $O(m^2n^2)$ per iteration
- 나중에 이 방법을 쓰게 될 예정이다.

Extensions to Dynamic Programming

Asynchronous Dynamic Programming

앞에서 설명한 건 synchronous backups을 이용한다. (모든 states를 한 번에 update)
Asynchronous DP backs up states individually
- 계산을 줄일 수 있다.
- 모든 states가 선택되도록 하면 convergence가 보장된다.
Three simple ideas for asynchronous DP(update 할 state를 고르는 방법이 다름):
- In-place dynamic programming
- Prioritised sweeping
- Real-time dynamic programming

In-Place Dynamic Programming

Synchronous value iteration:

until $\mathrm{v}$ converges
    for $s$ in $\mathcal{S}$
         $v_{new}(s) \gets \max\limits_{a\in\mathcal{A}}{[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_{old}(s')]}$
     $\mathrm{v}_{old} \gets \mathrm{v}_{new}$
In-place value iteration

until $\mathrm{v}$ converges
for $s$ in $\mathcal{S}$
$v(s) \gets \max\limits_{a\in\mathcal{A}}{[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v(s')]}$
- Buffer를 따로 두지 않고 하나의 $\mathrm{v}$ 를 모든 states에 대해 계속 update
- Update를 하는 states의 순서에 따라 아무 update도 없을 수도 있다! (states의 순서가 중요)
- In general, converge faster!

Prioritised Sweeping

In-place DP에 적절한 순서를 고려해주기 위해 도입되었다.
Use magnitude of Bellman error to guide state selection!
$|v'(s)-v(s)|=|\max\limits_{a\in\mathcal{A}}{[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v(s')]} - v(s)|$
Backup the state with the largest remaining Bellman error
매 update마다 predecessor $\mathrm{v}(s)$ 를 저장해놓아야 한다.

Real-Time Dynamic Programming

Update only states that the agent actually visited
After each time-step $S_t, A_t, R_{t+1}$
backup the state $S_t$
$v(S_t) \gets \max\limits_{a\in\mathcal{A}}[\mathcal{R}^a_{S_t}+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{S_ts'}v(s')]$

Sample backups

Full-width Backups

DP uses full-width backups
- $v_{k+1}(s)=\max\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_k(s')]$
max를 구하기 위해 모든 연결된 state를 다 고려해야 한다
For large problems, DP suffers Bellman's curse of dimensionality

Sample Backups

Using sample rewards and sample transitions $\lang S, A, R, S'\rang$
(instead of reward function $\mathcal{R}$ and transition dynamics $\mathcal{P}$ )
Advantages:
- Model-free: no advance knowledge of MDP required
- Breaks the curse of dimensionality through sampling
다음 시간부터 배울 예정!

Contraction Mapping

Contraction Mapping Theorem

For any metric space $\mathcal{V}$ that is complete (i.e. closed) under an operator $T(\mathrm{v})$ , where $T$ is a $\gamma$ -contraction, $T$ converges to a unique fixed point at a linear convergence rate of $\gamma$

Bellman Expectation Backup is a Contraction

Define the Bellman expectation backup operator $T^\pi$ ,
$T^\pi(\mathrm{v})=\mathcal{R}^\pi+\gamma\mathcal{P}^\pi\mathrm{v}$
This operator is a $\gamma$ -contraction
$\| T^\pi(\mathrm{u})-T^\pi(\mathrm{v}) \|_\infin = \| (\mathcal{R}^\pi-\gamma\mathcal{P}^\pi\mathrm{u}) - (\mathcal{R}^\pi-\gamma\mathcal{P}^\pi \mathrm{v})\|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\| \gamma\mathcal{P}^\pi(\mathrm{u}-\mathrm{v})\|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\leq\| \gamma\mathcal{P}^\pi \|\mathrm{u}-\mathrm{v}\|_\infin \|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\leq \gamma\|\mathrm{u}-\mathrm{v}\|_\infin$
임의의 두 vector $\mathrm{u}, \mathrm{v}$ 가 operator $T$ 에 의해 계속 가까워짐을 의미한다.
즉, ( $0<\gamma<1$ 일 때) Bellman expectation backup을 하면 임의의 value function vector가 한 점( $\mathrm{v}_\pi$ )으로 converge하게 된다. ( $\because$ Contraction Mapping Theorem)
Iterative policy evaluation converges! (on $\mathrm{v}_\pi$ )

Bellman Optimality Backup is a Contraction

Define the Bellman expectation backup operator $T^\pi$ ,
$T^*(\mathrm{v})=\max\limits_{a\in\mathcal{A}}[\mathcal{R}^a+\gamma\mathcal{P}^a\mathrm{v}]$
This operator is a $\gamma$ -contraction
$\| T^*(\mathrm{u})-T^*(\mathrm{v}) \|_\infin=\| \max\limits_{a\in\mathcal{A}}[\mathcal{R}^a+\gamma\mathcal{P}^a\mathrm{u}] - \max\limits_{a\in\mathcal{A}}[\mathcal{R}^a+\gamma\mathcal{P}^a\mathrm{v}] \|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\leq\| \max\limits_{a\in\mathcal{A}}{[\gamma\mathcal{P}^a\mathrm{u}]} - \max\limits_{a\in\mathcal{A}}{[\gamma\mathcal{P}^a\mathrm{v}]} \|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\leq\| \max\limits_{a\in\mathcal{A}}{[\gamma\mathcal{P}^a\mathrm{u}-\gamma\mathcal{P}^a\mathrm{v}]} \|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\leq\gamma\|\mathrm{u}-\mathrm{v}\|_\infin\\$
Value iteration converges! (on $\mathrm{v}_*$ )

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

Sanghyeok Choi

Lazy Enthusiast

이전 포스트

Silver RL (2) Markov Decision Process

다음 포스트

Silver RL (3) Planning by Dynamic Programming

Intro_to_RL

Introduction

What is Dynamic Programming (DP)?

Requirements for Dynamic Programming

Planning by Dynamic Programming

Policy Evaluation

Iterative Policy Evaluation

Example: Iterative Policy Evaluation in Small Gridworld

Policy Iteration

Policy Iteration

Policy Improvement

Generalized Policy Iteration

Value Iteration

Principle of Optimality Theorem

Deterministic Value Iteration

Policy Iteration vs Value Iteration

Example: Shortest Path

Summary of DP Algorithms

Extensions to Dynamic Programming

Asynchronous Dynamic Programming

In-Place Dynamic Programming

Prioritised Sweeping

Real-Time Dynamic Programming

Sample backups

Full-width Backups

Sample Backups

Contraction Mapping

Contraction Mapping Theorem

Bellman Expectation Backup is a Contraction

Bellman Optimality Backup is a Contraction

Silver RL (2) Markov Decision Process

Silver RL (4) Model-Free Prediction

0개의 댓글

관련 채용 정보