Silver RL (8) Integrating Learning and Planning

Sanghyeok Choi·2022년 1월 18일
0

Intro_to_RL

목록 보기
8/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 8: Integrating Learning and Planning (Youtube) 강의 내용을 정리했습니다.

Introduction

이전 시간까지는 model-free RL을 다뤘다. Experience로부터 directly policy와 value function을 학습했다. (model이 필요 없었다.)
이번 시간엔 experience로부터 model을 학습하고, 이 model을 기반으로 planning을 통해 policy와 value function을 얻고자 한다.

Model-Based and Model-Free RL

  • Recall, model in RL is
    1) How states transit to other states (by some action)
    2) How states & actions lead to reward
  • Model-Free RL
    • No model
    • Learn value function (and/or policy) from experience
  • Model-Based RL
    • Learn a model from experience
    • Plan value function (and/or policy) from model
      Note: Plan ... look-ahead with model

Image from: here

  • "Learned model" can be regarded as a simulated environment

Model-Based Reinforcement Learning

Image from: here

What is a Model?

  • A model M\mathcal{M} is a representation of an MDP S,A,P,R\lang \mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R} \rang, parametrized by η\eta
    Agent's view of the MDP
  • We will assume state space S\mathcal{S} and action space A\mathcal{A} are known
    (물론 더 복잡한 방법을 쓰면 얘네도 학습할 수 있다.)
    A model M=Pη,Rη\mathcal{M}=\lang\mathcal{P}_\eta,\mathcal{R}_\eta\rang represents state transitions PηP\mathcal{P}_\eta \approx \mathcal{P} and rewards RηR\mathcal{R}_\eta \approx \mathcal{R}
    St+1Pη(St+1St,At)S_{t+1} \sim \mathcal{P}_\eta(S_{t+1}|S_t,A_t)
    Rt+1=Rη(Rt+1St,At)R_{t+1} = \mathcal{R}_\eta(R_{t+1}|S_t,A_t)
  • Typically assume conditional independence between state transitions and rewards
    P[St+1,Rt+1St,At]=P[St+1St,At]P[Rt+1St,At]\mathbb{P}[S_{t+1}, R_{t+1}|S_t,A_t]=\mathbb{P}[S_{t+1}|S_t,A_t]\mathbb{P}[R_{t+1}|S_t,A_t]
  • Examples of Models
    • Table Lookup Model (scale-up 하기 어려움)
    • Linear Expectation Model
    • Linear Gaussian Model
    • Gaussian Process Model
    • Deep Belief Network Model
    • ...
      (Any supervised learning method can be used to build a model)

Advantages of Model-Based RL

  • Advantages:
    • Can efficiently learn model by supervised learning methods
      • (State ×\times Action) space가 넓고, action에 의해 value function이 크게 바뀌는 경우 (e.g. Chess) value function / policy를 directly 학습하기가 어렵다.
      • Model(=rules of the game)이 상대적으로 학습하기 쉽다면(\because supervised learning), Planning은 (model-free learning 보다는) 상대적으로 쉽기 때문에 학습이 가능해진다.
    • Can reason about model uncertainty
      • 뭘 모르고 아는지에 대한 정보를 얻을 수 있기 때문에 더 효율적인 학습이 가능하다. (모르는 거 위주로 학습!)
    • Sometimes more compact
    • Sometimes provide useful representation of the environment
  • Disadvantages:
    • First learn a model, then construct a value function
          \implies Two sources of approximation error

Model Learning

  • Goal: estimate Mη\mathcal{M}_\eta from experience {S1,A1,R2,...,ST}\{S_1,A_1,R_2,...,S_T\}
  • This is a supervised learning problem!
    S1,A1R2,S2S_1, A_1 \to R_2, S_2
    S2,A2R3,S3S_2, A_2 \to R_3, S_3
                 \space\space\space\space\space\space\space\space\space\space\space\space\space\vdots
    ST1,AT1RT,STS_{T-1}, A_{T-1} \to R_T, S_T
  • Learning s,ars, a \to r is a regression problem (expectation ... MSE)
    Learning s,ass, a \to s' is a density estimation problem (probability ... KL-divergence)
  • Find parameters η\eta that minimizes empirical loss function (MSE/KL-divergence).

Table Lookup Model

  • Model is an explicit MDP, with P^\hat{\mathcal{P}} and R^\hat{\mathcal{R}}
    P^s,sa=1N(s,a)t=1T1(St,At,St+1=s,a,s)\hat{\mathcal{P}}^{a}_{s,s'}=\cfrac{1}{N(s,a)}\sum\limits_{t=1}^T\bold{1}(S_t,A_t,S_{t+1}=s,a,s')
    R^sa=1N(s,a)t=1T1(St,At=s,a)Rt\hat{\mathcal{R}}^{a}_s=\cfrac{1}{N(s,a)}\sum\limits_{t=1}^T\bold{1}(S_t,A_t=s,a)R_t
    where N(s,a)N(s,a) is the number of visits to each state-action pair.
  • Alternatively,
    • At each time step t, record experience tuple
      St,At,Rt+1,St+1\lang S_t, A_t, R_{t+1}, S_{t+1} \rang
    • To sample model, randomly pick tuple matching s,a,,\lang s, a, \cdot, \cdot \rang
  • Example:
    Image from: here

Planning with a Model

  • Given a model Mη=Pη,Rη\mathcal{M}_\eta=\lang\mathcal{P}_\eta,\mathcal{R}_\eta\rang,
    solve the MDP S,A,Pη,Rη\lang \mathcal{S}, \mathcal{A}, \mathcal{P}_\eta, \mathcal{R}_\eta \rang
    using planning algorithms such as:
    • Value iteration
    • Policy iteration
    • Tree search
    • ...
  • Sample-Based Planning
    Model로 sample을 많이 만들어서 RL을 적용!
    • Use the model only to generate samples:
      St+1Pη(St+1St,At)S_{t+1} \sim \mathcal{P}_\eta(S_{t+1}|S_t,A_t)
      Rt+1=Rη(Rt+1St,At)R_{t+1} = \mathcal{R}_\eta(R_{t+1}|S_t,A_t)
    • Apply model-free RL to those samples, e.g.:
      • Monte-Carlo control
      • Sarsa
      • Q-learning
    • Model이 정확하기만 하면 좋은 방법임.
      Unseen transition/reward는 학습할 수 없음. (model이 불완전할 때 좋지 않음)
  • Planning with an inaccurate model
    • Given an imperfect model Pη,RηP,R\lang \mathcal{P}_\eta, \mathcal{R}_\eta \rang \neq \lang \mathcal{P}, \mathcal{R} \rang,
      the performance of model-based RL is limited to optimal policy for approximate MDP S,A,Pη,Rη\lang \mathcal{S}, \mathcal{A}, \mathcal{P}_\eta, \mathcal{R}_\eta \rang
    • When the model is inaccurate, planning process will compute a suboptimal policy
    • Solution
      1) Use model-free RL
      2) reason explicitly about model uncertainty (e.g. Bayesian approach)

Integrated Architectures

Dyna

  • We consider two sources of experience
    • Real experience (Sampled from environment, true MDP)
      SPs,saS' \sim \mathcal{P}^a_{s,s'}
      R=RsaR=\mathcal{R}^a_s
    • Simulated experience (Sampled from model, approximate MDP)
      SPη(SS,A)S' \sim \mathcal{P}_\eta(S'|S,A)
      R=Rη(RS,A)R = \mathcal{R}_\eta(R|S,A)
  • Dyna, integrating learning and planning
    • Learn a model from real experience
    • Learn and plan value function (and/or policy) from real and simulated experience
      Image from: here
    • 두 방법의 장점을 동시에 갖게 되어 더 효율적이다!
  • Dyna-Q algorithm
    Image from: here
    Note: real experience로 QQ와 Model을 각각 한 step씩 학습한 후 Model로 n개의 simulated examples를 만들어 QQ를 추가로 학습한다.

Model을 알고 있을 때 search를 통해 planning.

  • Select best action by look-ahead
  • Build a search tree with the current state sts_t at the root
    using a model of the MDP to look ahead
    (model을 이용해 현재 state인 sts_t 부터 시작하는 search tree를 만든다.)
    Image from: here
  • 현재 state부터 시작하는 sub-MDP를 풂으로써 전체 MDP를 풀 필요가 없어진다.
    (전체 MDP를 푸는 게 일반적으로 더 어렵고 resource가 많이 듦)

Simulation-Based Search

  • Model을 이용해 sts_t부터의 simulated episodes를 만들고, model-free RL을 적용한다.
    • Sampling
      {stk,Atk,Rtk,...,STk}k=1KMν\{s^k_t, A^k_t, R^k_t,...,S^k_T\}^K_{k=1} \sim \mathcal{M}_\mathcal{\nu}
    • Model-free RL
      • Monte-Carlo control \to Monte-Carlo search
      • Sarsa \to TD search
  • Given a model Mν\mathcal{M}_\nu and a simulation policy π\pi
  • For each action aAa\in\mathcal{A}
    • Simulate K episodes from current (real) state sts_t using the simulation policy π\pi
      {st,a,Rt+1k,St+1k,At+1k,...,STk}k=1KMν,π\{s_t, a, R_{t+1}^k,S_{t+1}^k,A_{t+1}^k,...,S_T^k\}^K_{k=1}\sim\mathcal{M}_\nu,\pi
    • Evaluate actions by mean return (Monte-Carlo evaluation)
      qπ(st,a)Q(st,a)=1Kk=1KGtq_\pi(s_t,a) \approx Q(s_t, a)=\cfrac{1}{K}\sum\limits_{k=1}^KG_t
      Note: qπq_\pi is the real action value function
  • Select current (real) action with maximum value
    at=arg maxaAQ(st,a)a_t=\argmax\limits_{a\in\mathcal{A}}Q(s_t,a)

(앞의 Simple MC search에선 action마다 K개의 episode를 만들었다면, 이번엔 Tree를 만든다.)

  • MCTS algorithm
    • Given a model Mν\mathcal{M}_\nu and a simulation policy π\pi
    • Simulate K episodes from current (real) state sts_t using the simulation policy π\pi
      {st,Atk,Rt+1k,St+1k,At+1k,...,STk}k=1KMν,π\{s_t, A_t^k, R_{t+1}^k,S_{t+1}^k,A_{t+1}^k,...,S_T^k\}^K_{k=1}\sim\mathcal{M}_\nu,\pi
    • Build a search tree containing visited states and actions so far
    • Evaluate Q(s,a)Q(s,a) by mean return of episodes from s,as,a
      qπ(s,a)Q(s,a)=1N(s,a)k=1Ku=tT1(Su,Au=s,a)Guq_\pi(s,a) \approx Q(s,a)=\cfrac{1}{N(s,a)}\sum\limits_{k=1}^K\sum\limits_{u=t}^T\bold{1}(S_u,A_u=s,a)G_u
      where GuG_uSu,AuS_u, A_u 이후의 return (tree의 (s,a)에 해당하는 node의 아래쪽 return을 계산하면 된다.)
      Note: 앞의 simple MC search에선 현재 state sts_t에 대해서만 QQ를 얻었는데, 여기선 sts_t 이후의 모든 (s,a)(s,a)에 대해서 QQ를 계산해줌.
      Note2: Tree는 일종의 memory 역할을 하게 된다.
    • After search is finished, select current (real) action with maximum value in search tree
      at=arg maxaAQ(st,a)a_t=\argmax\limits_{a\in\mathcal{A}}Q(s_t,a)
      Note: 굳이 tree를 만드는 이유는, st=Ss_t=S'라고 할 때 t 이후의 step에서도 SS'가 나올 수 있으므로 QQ를 더 효율적으로 추정할 수 있기 때문이다.
  • In other words,
    • Each simulation consists of two phases (in-tree, out-of-tree)
      • pick action randomly (default policy, fixed, exploration)
      • pick action to maximize Q(S,A)Q(S, A) (tree policy, improves, exploitation)
    • Simulate K episodes with default policy
      \Rarr Evaluate state Q(S,A)Q(S,A) by Monte-Carlo evaluation
      \Rarr Improve tree policy, e.g., by ϵ\epsilon-greedy
      Note: This converges on the optimal search tree, i.e., Q(S,A)q(S,A)Q(S,A) \to q_*(S,A)
  • Advantages of MC Tree Search
    • Highly selective best-first search
    • Evaluate states dynamically (현재 state를 evaluate)
      (DP는 entire state space를 evaluate)
    • Uses sampling to break curse of dimensionality
    • Works for "black-box" models (only requires samples)
    • Computationally efficient, anytime, parallelizable

Use TD (bootstrapping) instead of MC for control

  • MC tree search applies MC control to sub-MDP from now
    TD search applies Sarsa to sub-MDP from now
  • TD search algorithm
    • Given a model Mν\mathcal{M}_\nu and a simulation policy π\pi
    • Simulate K episodes from current (real) state sts_t using the simulation policy π\pi
      {st,Atk,Rt+1k,St+1k,At+1k,...,STk}k=1KMν,π\{s_t, A_t^k, R_{t+1}^k,S_{t+1}^k,A_{t+1}^k,...,S_T^k\}^K_{k=1}\sim\mathcal{M}_\nu,\pi
    • Build a search tree containing visited states and actions so far
    • For each step of simulation, Evaluate Q(s,a)Q(s,a) by Sarsa
      Q(St,At)Q(St,At)+α(Rt+1+γQ(St+1,At+1)Q(St,At))Q(S_t,A_t) \gets Q(S_t,A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t))
    • Select actions based on action-value function QQ
      at=arg maxaAQ(st,a)a_t=\argmax\limits_{a\in\mathcal{A}}Q(s_t,a)
  • TD has lower variance (but has bias), and thus more efficient than MC

Dyna-2

  • In Dyna-2, the agent stores two sets of feature weights
    • Long-term memory:
      Updated from real experience using TD learning.
      Learn general domain knowledge that applies to any episode
    • Short-term (working) memory:
      updated from simulated experience using TD search
      Learn specific local knowledge about the current situation
    • 자세한 내용은 논문 참조

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

profile
Lazy Enthusiast

0개의 댓글

관련 채용 정보