Silver RL (8) Integrating Learning and Planning

Sanghyeok Choi·2022년 1월 18일

Intro_to_RL

목록 보기

8/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 8: Integrating Learning and Planning (Youtube) 강의 내용을 정리했습니다.

Introduction

이전 시간까지는 model-free RL을 다뤘다. Experience로부터 directly policy와 value function을 학습했다. (model이 필요 없었다.)
이번 시간엔 experience로부터 model을 학습하고, 이 model을 기반으로 planning을 통해 policy와 value function을 얻고자 한다.

Model-Based and Model-Free RL

Recall, model in RL is
1) How states transit to other states (by some action)
2) How states & actions lead to reward

Model-Free RL
- No model
- Learn value function (and/or policy) from experience
Model-Based RL
- Learn a model from experience
- Plan value function (and/or policy) from model
  Note: Plan ... look-ahead with model

Image from: here

"Learned model" can be regarded as a simulated environment

Model-Based Reinforcement Learning

Image from: here

What is a Model?

A model $\mathcal{M}$ is a representation of an MDP $\lang \mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R} \rang$ , parametrized by $\eta$
Agent's view of the MDP
We will assume state space $\mathcal{S}$ and action space $\mathcal{A}$ are known
(물론 더 복잡한 방법을 쓰면 얘네도 학습할 수 있다.)
A model $\mathcal{M}=\lang\mathcal{P}_\eta,\mathcal{R}_\eta\rang$ represents state transitions $\mathcal{P}_\eta \approx \mathcal{P}$ and rewards $\mathcal{R}_\eta \approx \mathcal{R}$
$S_{t+1} \sim \mathcal{P}_\eta(S_{t+1}|S_t,A_t)$
$R_{t+1} = \mathcal{R}_\eta(R_{t+1}|S_t,A_t)$
Typically assume conditional independence between state transitions and rewards
$\mathbb{P}[S_{t+1}, R_{t+1}|S_t,A_t]=\mathbb{P}[S_{t+1}|S_t,A_t]\mathbb{P}[R_{t+1}|S_t,A_t]$

Examples of Models
- Table Lookup Model (scale-up 하기 어려움)
- Linear Expectation Model
- Linear Gaussian Model
- Gaussian Process Model
- Deep Belief Network Model
- ...
  (Any supervised learning method can be used to build a model)

Advantages of Model-Based RL

Advantages:
- Can efficiently learn model by supervised learning methods
  - (State $\times$ Action) space가 넓고, action에 의해 value function이 크게 바뀌는 경우 (e.g. Chess) value function / policy를 directly 학습하기가 어렵다.
  - Model(=rules of the game)이 상대적으로 학습하기 쉽다면( $\because$ supervised learning), Planning은 (model-free learning 보다는) 상대적으로 쉽기 때문에 학습이 가능해진다.
- Can reason about model uncertainty
  - 뭘 모르고 아는지에 대한 정보를 얻을 수 있기 때문에 더 효율적인 학습이 가능하다. (모르는 거 위주로 학습!)
- Sometimes more compact
- Sometimes provide useful representation of the environment

Disadvantages:
- First learn a model, then construct a value function
  $\implies$ Two sources of approximation error

Model Learning

Goal: estimate $\mathcal{M}_\eta$ from experience $\{S_1,A_1,R_2,...,S_T\}$
This is a supervised learning problem!
$S_1, A_1 \to R_2, S_2$
$S_2, A_2 \to R_3, S_3$
$\space\space\space\space\space\space\space\space\space\space\space\space\space\vdots$
$S_{T-1}, A_{T-1} \to R_T, S_T$
Learning $s, a \to r$ is a regression problem (expectation ... MSE)
Learning $s, a \to s'$ is a density estimation problem (probability ... KL-divergence)
Find parameters $\eta$ that minimizes empirical loss function (MSE/KL-divergence).

Table Lookup Model

Model is an explicit MDP, with $\hat{\mathcal{P}}$ and $\hat{\mathcal{R}}$
$\hat{\mathcal{P}}^{a}_{s,s'}=\cfrac{1}{N(s,a)}\sum\limits_{t=1}^T\bold{1}(S_t,A_t,S_{t+1}=s,a,s')$
$\hat{\mathcal{R}}^{a}_s=\cfrac{1}{N(s,a)}\sum\limits_{t=1}^T\bold{1}(S_t,A_t=s,a)R_t$
where $N(s,a)$ is the number of visits to each state-action pair.
Alternatively,
- At each time step t, record experience tuple
  $\lang S_t, A_t, R_{t+1}, S_{t+1} \rang$
- To sample model, randomly pick tuple matching $\lang s, a, \cdot, \cdot \rang$
Example:
Image from: here

Planning with a Model

Given a model $\mathcal{M}_\eta=\lang\mathcal{P}_\eta,\mathcal{R}_\eta\rang$ ,
solve the MDP $\lang \mathcal{S}, \mathcal{A}, \mathcal{P}_\eta, \mathcal{R}_\eta \rang$
using planning algorithms such as:
- Value iteration
- Policy iteration
- Tree search
- ...

Sample-Based Planning
Model로 sample을 많이 만들어서 RL을 적용!
- Use the model only to generate samples:
  $S_{t+1} \sim \mathcal{P}_\eta(S_{t+1}|S_t,A_t)$
  $R_{t+1} = \mathcal{R}_\eta(R_{t+1}|S_t,A_t)$
- Apply model-free RL to those samples, e.g.:
  - Monte-Carlo control
  - Sarsa
  - Q-learning
- Model이 정확하기만 하면 좋은 방법임.
  Unseen transition/reward는 학습할 수 없음. (model이 불완전할 때 좋지 않음)

Planning with an inaccurate model
- Given an imperfect model $\lang \mathcal{P}_\eta, \mathcal{R}_\eta \rang \neq \lang \mathcal{P}, \mathcal{R} \rang$ ,
  the performance of model-based RL is limited to optimal policy for approximate MDP $\lang \mathcal{S}, \mathcal{A}, \mathcal{P}_\eta, \mathcal{R}_\eta \rang$
- When the model is inaccurate, planning process will compute a suboptimal policy
- Solution
  1) Use model-free RL
  2) reason explicitly about model uncertainty (e.g. Bayesian approach)

Integrated Architectures

Dyna

We consider two sources of experience
- Real experience (Sampled from environment, true MDP)
  $S' \sim \mathcal{P}^a_{s,s'}$
  $R=\mathcal{R}^a_s$
- Simulated experience (Sampled from model, approximate MDP)
  $S' \sim \mathcal{P}_\eta(S'|S,A)$
  $R = \mathcal{R}_\eta(R|S,A)$
Dyna, integrating learning and planning
- Learn a model from real experience
- Learn and plan value function (and/or policy) from real and simulated experience
  Image from: here
- 두 방법의 장점을 동시에 갖게 되어 더 효율적이다!
Dyna-Q algorithm
Image from: here
Note: real experience로 $Q$ 와 Model을 각각 한 step씩 학습한 후 Model로 n개의 simulated examples를 만들어 $Q$ 를 추가로 학습한다.

Simulation-Based Search

Model을 알고 있을 때 search를 통해 planning.

Forward Search

Select best action by look-ahead
Build a search tree with the current state $s_t$ at the root
using a model of the MDP to look ahead
(model을 이용해 현재 state인 $s_t$ 부터 시작하는 search tree를 만든다.)
Image from: here
현재 state부터 시작하는 sub-MDP를 풂으로써 전체 MDP를 풀 필요가 없어진다.
(전체 MDP를 푸는 게 일반적으로 더 어렵고 resource가 많이 듦)

Simulation-Based Search

Model을 이용해 $s_t$ 부터의 simulated episodes를 만들고, model-free RL을 적용한다.
- Sampling
  $\{s^k_t, A^k_t, R^k_t,...,S^k_T\}^K_{k=1} \sim \mathcal{M}_\mathcal{\nu}$
- Model-free RL
  - Monte-Carlo control $\to$ Monte-Carlo search
  - Sarsa $\to$ TD search

Simple Monte-Carlo Search

Given a model $\mathcal{M}_\nu$ and a simulation policy $\pi$
For each action $a\in\mathcal{A}$
- Simulate K episodes from current (real) state $s_t$ using the simulation policy $\pi$
  $\{s_t, a, R_{t+1}^k,S_{t+1}^k,A_{t+1}^k,...,S_T^k\}^K_{k=1}\sim\mathcal{M}_\nu,\pi$
- Evaluate actions by mean return (Monte-Carlo evaluation)
  $q_\pi(s_t,a) \approx Q(s_t, a)=\cfrac{1}{K}\sum\limits_{k=1}^KG_t$
  Note: $q_\pi$ is the real action value function
Select current (real) action with maximum value
$a_t=\argmax\limits_{a\in\mathcal{A}}Q(s_t,a)$

Monte-Carlo Tree Search

(앞의 Simple MC search에선 action마다 K개의 episode를 만들었다면, 이번엔 Tree를 만든다.)

MCTS algorithm
- Given a model $\mathcal{M}_\nu$ and a simulation policy $\pi$
- Simulate K episodes from current (real) state $s_t$ using the simulation policy $\pi$
  $\{s_t, A_t^k, R_{t+1}^k,S_{t+1}^k,A_{t+1}^k,...,S_T^k\}^K_{k=1}\sim\mathcal{M}_\nu,\pi$
- Build a search tree containing visited states and actions so far
- Evaluate $Q(s,a)$ by mean return of episodes from $s,a$
  $q_\pi(s,a) \approx Q(s,a)=\cfrac{1}{N(s,a)}\sum\limits_{k=1}^K\sum\limits_{u=t}^T\bold{1}(S_u,A_u=s,a)G_u$
  where $G_u$ 는 $S_u, A_u$ 이후의 return (tree의 (s,a)에 해당하는 node의 아래쪽 return을 계산하면 된다.)
  Note: 앞의 simple MC search에선 현재 state $s_t$ 에 대해서만 $Q$ 를 얻었는데, 여기선 $s_t$ 이후의 모든 $(s,a)$ 에 대해서 $Q$ 를 계산해줌.
  Note2: Tree는 일종의 memory 역할을 하게 된다.
- After search is finished, select current (real) action with maximum value in search tree
  $a_t=\argmax\limits_{a\in\mathcal{A}}Q(s_t,a)$
  Note: 굳이 tree를 만드는 이유는, $s_t=S'$ 라고 할 때 t 이후의 step에서도 $S'$ 가 나올 수 있으므로 $Q$ 를 더 효율적으로 추정할 수 있기 때문이다.

In other words,
- Each simulation consists of two phases (in-tree, out-of-tree)
  - pick action randomly (default policy, fixed, exploration)
  - pick action to maximize $Q(S, A)$ (tree policy, improves, exploitation)
- Simulate K episodes with default policy
  $\Rarr$ Evaluate state $Q(S,A)$ by Monte-Carlo evaluation
  $\Rarr$ Improve tree policy, e.g., by $\epsilon$ -greedy
  Note: This converges on the optimal search tree, i.e., $Q(S,A) \to q_*(S,A)$

Advantages of MC Tree Search
- Highly selective best-first search
- Evaluate states dynamically (현재 state를 evaluate)
  (DP는 entire state space를 evaluate)
- Uses sampling to break curse of dimensionality
- Works for "black-box" models (only requires samples)
- Computationally efficient, anytime, parallelizable

Temporal-Difference Search

Use TD (bootstrapping) instead of MC for control

MC tree search applies MC control to sub-MDP from now
TD search applies Sarsa to sub-MDP from now

TD search algorithm
- Given a model $\mathcal{M}_\nu$ and a simulation policy $\pi$
- Simulate K episodes from current (real) state $s_t$ using the simulation policy $\pi$
  $\{s_t, A_t^k, R_{t+1}^k,S_{t+1}^k,A_{t+1}^k,...,S_T^k\}^K_{k=1}\sim\mathcal{M}_\nu,\pi$
- Build a search tree containing visited states and actions so far
- For each step of simulation, Evaluate $Q(s,a)$ by Sarsa
  $Q(S_t,A_t) \gets Q(S_t,A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t))$
- Select actions based on action-value function $Q$
  $a_t=\argmax\limits_{a\in\mathcal{A}}Q(s_t,a)$

TD has lower variance (but has bias), and thus more efficient than MC

Dyna-2

In Dyna-2, the agent stores two sets of feature weights
- Long-term memory:
  Updated from real experience using TD learning.
  Learn general domain knowledge that applies to any episode
- Short-term (working) memory:
  updated from simulated experience using TD search
  Learn specific local knowledge about the current situation
- 자세한 내용은 논문 참조

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

Sanghyeok Choi

Lazy Enthusiast

이전 포스트

Silver RL (7) Policy Gradient

다음 포스트

Silver RL (8) Integrating Learning and Planning

Intro_to_RL

Introduction

Model-Based and Model-Free RL

Model-Based Reinforcement Learning

What is a Model?

Advantages of Model-Based RL

Model Learning

Table Lookup Model

Planning with a Model

Integrated Architectures

Dyna

Simulation-Based Search

Forward Search

Simulation-Based Search

Simple Monte-Carlo Search

Monte-Carlo Tree Search

Temporal-Difference Search

Dyna-2

Silver RL (7) Policy Gradient

Silver RL (9) Exploration and Exploitation

0개의 댓글

관련 채용 정보