David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 8: Integrating Learning and Planning (Youtube) 강의 내용을 정리했습니다.
Introduction
이전 시간까지는 model-free RL을 다뤘다. Experience로부터 directly policy와 value function을 학습했다. (model이 필요 없었다.)
이번 시간엔 experience로부터 model을 학습하고, 이 model을 기반으로 planning을 통해 policy와 value function을 얻고자 한다.
Model-Based and Model-Free RL
- Recall, model in RL is
1) How states transit to other states (by some action)
2) How states & actions lead to reward
- Model-Free RL
- No model
- Learn value function (and/or policy) from experience
- Model-Based RL
- Learn a model from experience
- Plan value function (and/or policy) from model
Note: Plan ... look-ahead with model
Image from:
here
- "Learned model" can be regarded as a simulated environment
Model-Based Reinforcement Learning
Image from:
here
What is a Model?
- A model M is a representation of an MDP ⟨S,A,P,R⟩, parametrized by η
Agent's view of the MDP
- We will assume state space S and action space A are known
(물론 더 복잡한 방법을 쓰면 얘네도 학습할 수 있다.)
A model M=⟨Pη,Rη⟩ represents state transitions Pη≈P and rewards Rη≈R
St+1∼Pη(St+1∣St,At)
Rt+1=Rη(Rt+1∣St,At)
- Typically assume conditional independence between state transitions and rewards
P[St+1,Rt+1∣St,At]=P[St+1∣St,At]P[Rt+1∣St,At]
- Examples of Models
- Table Lookup Model (scale-up 하기 어려움)
- Linear Expectation Model
- Linear Gaussian Model
- Gaussian Process Model
- Deep Belief Network Model
- ...
(Any supervised learning method can be used to build a model)
Advantages of Model-Based RL
- Advantages:
- Can efficiently learn model by supervised learning methods
- (State × Action) space가 넓고, action에 의해 value function이 크게 바뀌는 경우 (e.g. Chess) value function / policy를 directly 학습하기가 어렵다.
- Model(=rules of the game)이 상대적으로 학습하기 쉽다면(∵ supervised learning), Planning은 (model-free learning 보다는) 상대적으로 쉽기 때문에 학습이 가능해진다.
- Can reason about model uncertainty
- 뭘 모르고 아는지에 대한 정보를 얻을 수 있기 때문에 더 효율적인 학습이 가능하다. (모르는 거 위주로 학습!)
- Sometimes more compact
- Sometimes provide useful representation of the environment
- Disadvantages:
- First learn a model, then construct a value function
⟹ Two sources of approximation error
Model Learning
- Goal: estimate Mη from experience {S1,A1,R2,...,ST}
- This is a supervised learning problem!
S1,A1→R2,S2
S2,A2→R3,S3
⋮
ST−1,AT−1→RT,ST
- Learning s,a→r is a regression problem (expectation ... MSE)
Learning s,a→s′ is a density estimation problem (probability ... KL-divergence)
- Find parameters η that minimizes empirical loss function (MSE/KL-divergence).
Table Lookup Model
- Model is an explicit MDP, with P^ and R^
P^s,s′a=N(s,a)1t=1∑T1(St,At,St+1=s,a,s′)
R^sa=N(s,a)1t=1∑T1(St,At=s,a)Rt
where N(s,a) is the number of visits to each state-action pair.
- Alternatively,
- At each time step t, record experience tuple
⟨St,At,Rt+1,St+1⟩
- To sample model, randomly pick tuple matching ⟨s,a,⋅,⋅⟩
- Example:
Image from: here
Planning with a Model
- Given a model Mη=⟨Pη,Rη⟩,
solve the MDP ⟨S,A,Pη,Rη⟩
using planning algorithms such as:
- Value iteration
- Policy iteration
- Tree search
- ...
- Sample-Based Planning
Model로 sample을 많이 만들어서 RL을 적용!
- Use the model only to generate samples:
St+1∼Pη(St+1∣St,At)
Rt+1=Rη(Rt+1∣St,At)
- Apply model-free RL to those samples, e.g.:
- Monte-Carlo control
- Sarsa
- Q-learning
- Model이 정확하기만 하면 좋은 방법임.
Unseen transition/reward는 학습할 수 없음. (model이 불완전할 때 좋지 않음)
- Planning with an inaccurate model
- Given an imperfect model ⟨Pη,Rη⟩=⟨P,R⟩,
the performance of model-based RL is limited to optimal policy for approximate MDP ⟨S,A,Pη,Rη⟩
- When the model is inaccurate, planning process will compute a suboptimal policy
- Solution
1) Use model-free RL
2) reason explicitly about model uncertainty (e.g. Bayesian approach)
Integrated Architectures
Dyna
- We consider two sources of experience
- Real experience (Sampled from environment, true MDP)
S′∼Ps,s′a
R=Rsa
- Simulated experience (Sampled from model, approximate MDP)
S′∼Pη(S′∣S,A)
R=Rη(R∣S,A)
- Dyna, integrating learning and planning
- Learn a model from real experience
- Learn and plan value function (and/or policy) from real and simulated experience
Image from: here
- 두 방법의 장점을 동시에 갖게 되어 더 효율적이다!
- Dyna-Q algorithm
Image from: here
Note: real experience로 Q와 Model을 각각 한 step씩 학습한 후 Model로 n개의 simulated examples를 만들어 Q를 추가로 학습한다.
Simulation-Based Search
Model을 알고 있을 때 search를 통해 planning.
Forward Search
- Select best action by look-ahead
- Build a search tree with the current state st at the root
using a model of the MDP to look ahead
(model을 이용해 현재 state인 st 부터 시작하는 search tree를 만든다.)
Image from: here
- 현재 state부터 시작하는 sub-MDP를 풂으로써 전체 MDP를 풀 필요가 없어진다.
(전체 MDP를 푸는 게 일반적으로 더 어렵고 resource가 많이 듦)
Simulation-Based Search
- Model을 이용해 st부터의 simulated episodes를 만들고, model-free RL을 적용한다.
- Sampling
{stk,Atk,Rtk,...,STk}k=1K∼Mν
- Model-free RL
- Monte-Carlo control → Monte-Carlo search
- Sarsa → TD search
Simple Monte-Carlo Search
- Given a model Mν and a simulation policy π
- For each action a∈A
- Simulate K episodes from current (real) state st using the simulation policy π
{st,a,Rt+1k,St+1k,At+1k,...,STk}k=1K∼Mν,π
- Evaluate actions by mean return (Monte-Carlo evaluation)
qπ(st,a)≈Q(st,a)=K1k=1∑KGt
Note: qπ is the real action value function
- Select current (real) action with maximum value
at=a∈AargmaxQ(st,a)
Monte-Carlo Tree Search
(앞의 Simple MC search에선 action마다 K개의 episode를 만들었다면, 이번엔 Tree를 만든다.)
- MCTS algorithm
- Given a model Mν and a simulation policy π
- Simulate K episodes from current (real) state st using the simulation policy π
{st,Atk,Rt+1k,St+1k,At+1k,...,STk}k=1K∼Mν,π
- Build a search tree containing visited states and actions so far
- Evaluate Q(s,a) by mean return of episodes from s,a
qπ(s,a)≈Q(s,a)=N(s,a)1k=1∑Ku=t∑T1(Su,Au=s,a)Gu
where Gu는 Su,Au 이후의 return (tree의 (s,a)에 해당하는 node의 아래쪽 return을 계산하면 된다.)
Note: 앞의 simple MC search에선 현재 state st에 대해서만 Q를 얻었는데, 여기선 st 이후의 모든 (s,a)에 대해서 Q를 계산해줌.
Note2: Tree는 일종의 memory 역할을 하게 된다.
- After search is finished, select current (real) action with maximum value in search tree
at=a∈AargmaxQ(st,a)
Note: 굳이 tree를 만드는 이유는, st=S′라고 할 때 t 이후의 step에서도 S′가 나올 수 있으므로 Q를 더 효율적으로 추정할 수 있기 때문이다.
- In other words,
- Each simulation consists of two phases (in-tree, out-of-tree)
- pick action randomly (default policy, fixed, exploration)
- pick action to maximize Q(S,A) (tree policy, improves, exploitation)
- Simulate K episodes with default policy
⇒ Evaluate state Q(S,A) by Monte-Carlo evaluation
⇒ Improve tree policy, e.g., by ϵ-greedy
Note: This converges on the optimal search tree, i.e., Q(S,A)→q∗(S,A)
- Advantages of MC Tree Search
- Highly selective best-first search
- Evaluate states dynamically (현재 state를 evaluate)
(DP는 entire state space를 evaluate)
- Uses sampling to break curse of dimensionality
- Works for "black-box" models (only requires samples)
- Computationally efficient, anytime, parallelizable
Temporal-Difference Search
Use TD (bootstrapping) instead of MC for control
- MC tree search applies MC control to sub-MDP from now
TD search applies Sarsa to sub-MDP from now
- TD search algorithm
- Given a model Mν and a simulation policy π
- Simulate K episodes from current (real) state st using the simulation policy π
{st,Atk,Rt+1k,St+1k,At+1k,...,STk}k=1K∼Mν,π
- Build a search tree containing visited states and actions so far
- For each step of simulation, Evaluate Q(s,a) by Sarsa
Q(St,At)←Q(St,At)+α(Rt+1+γQ(St+1,At+1)−Q(St,At))
- Select actions based on action-value function Q
at=a∈AargmaxQ(st,a)
- TD has lower variance (but has bias), and thus more efficient than MC
Dyna-2
- In Dyna-2, the agent stores two sets of feature weights
- Long-term memory:
Updated from real experience using TD learning.
Learn general domain knowledge that applies to any episode
- Short-term (working) memory:
updated from simulated experience using TD search
Learn specific local knowledge about the current situation
- 자세한 내용은 논문 참조
혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!