Silver RL (6) Value Function Approximation

Sanghyeok Choi·2022년 1월 2일
0

Intro_to_RL

목록 보기
6/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 6: Value Function Approximation (Youtube) 강의 내용을 정리했습니다.

Introduction

Large-Scale Reinforcement Learning

  • RL can be used to solve large problems, e.g.
    • Backgammon: 102010^{20} states
    • Computer Go: 1017010^{170} states
    • Helicopter: continuous state space
  • 이때까지 우리는 Q(s,a)Q(s,a)lookup table로 생각하고 각 table의 값을 update 해왔다.
    하지만 state가 위 예시들처럼 엄청 많거나 continuous space인 경우엔 table을 만들 수 없거나 지나치게 비효율적이다.
  • How can we scale up the model-free methods for prediction and control from the last two lectures?
    => Value function approximation

Value Function Approximation

  • For large MDPs:
    • Estimate value function with function approximation
      • v^(s,w)vπ(s)\hat{v}(s,\bold{w}) \approx v_\pi(s)
      • q^(s,a,w)qπ(s,a)\hat{q}(s,a,\bold{w}) \approx q_\pi(s,a)
    • Generalize from seen states to unseen states
    • Update parameter w\bold{w} using MC or TD learning
  • Function Approximators
    • Linear combinations of features
    • Neural network
    • Decision tree
    • Nearest neighbor
    • ...
  • Note: We will only consider differentiable approximators, the "Linear combinations of features" and the "Neural network".

Incremental Methods

Gradient Descent

  • Let J(w)J(\bold{w}) be a differentiable function of parameter vector w\bold{w}
  • Gradient of J(w)J(\bold{w}) is:
    wJ(w)=(J(w)w1J(w)wn)\nabla_{\bold{w}}J(\bold{w}) = \begin{pmatrix} \cfrac{\partial{J(\bold{w})}}{\partial{w_1}}\\ \vdots\\ \cfrac{\partial{J(\bold{w})}}{\partial{w_n}} \end{pmatrix}
  • To find a local minimum of J(w)J(\bold{w}), adjust w\bold{w} in direction of wJ(w)-\nabla_{\bold{w}}J(\bold{w})
  • i.e., ww12αwJ(w)\bold{w} \gets \bold{w} - \cfrac{1}{2}\alpha\nabla_{\bold{w}}J(\bold{w}), where α\alpha is a step-size parameter

Value Function Approx. By Stochastic Gradient Descent

  • vπ(s)v_\pi(s)를 알고 있다고 가정하고, 우리의 value function approximator를 v^(s,w)\hat{v}(s,\bold{w})라고 하자.
  • Let J(w)=Eπ[(vπ(S)v^(S,w))2]J(\bold{w})=\mathbb{E}_\pi{[(v_\pi(S)-\hat{v}(S,\bold{w}))^2]}
    (MSE between vπ(s)v_\pi(s) and v^(s,w)\hat{v}(s,\bold{w}), JJ를 최소화하면 vvv^\hat{v}가 가까워진다.)
  • Then,
    wJ(w)=Eπ[2×(vπ(S)v^(S,w))wv^(S,w)]\nabla_{\bold{w}}J(\bold{w})=-\mathbb{E}_\pi{[2\times(v_\pi(S)-\hat{v}(S,\bold{w}))\nabla_{\bold{w}}\hat{v}(S,\bold{w})]},
    and,
    Δw=12αwJ(w)        =αEπ[(vπ(S)v^(S,w))wv^(S,w)]\Delta\bold{w}=-\cfrac{1}{2}\alpha\nabla_{\bold{w}}J(\bold{w})\\ \space\space\space\space\space\space\space\space=\alpha\mathbb{E}_\pi{[(v_\pi(S)-\hat{v}(S,\bold{w}))\nabla_{\bold{w}}\hat{v}(S,\bold{w})]}
  • Stochastic gradient descent samples the gradient
    Δw=α(vπ(S)v^(S,w))wv^(S,w)\Delta\bold{w}=\alpha(v_\pi(S)-\hat{v}(S,\bold{w}))\nabla_{\bold{w}}\hat{v}(S,\bold{w})

Linear Value Function Approximation

  • Represent state by a feature vector
    x(S)=(x1(S)xn(S))\bold{x}(S)=\begin{pmatrix} x_1(S)\\ \vdots\\ x_n(S) \end{pmatrix}
    • e.g. robot의 3차원 좌표 (기준점에서의 xyz 거리), 포트폴리오에 있는 주식들의 가격
  • Represent value function by a linear combination of features
    v^(S,w)=x(S)Tw=j=1nxj(S)wj\hat{v}(S,\bold{w})=\bold{x}(S)^T\bold{w}=\sum\limits_{j=1}^{n}x_j(S)w_j
  • Objective function is quadratic in parameters w\bold{w}
    J(w)=Eπ[(vπ(S)x(S)Tw)2]J(\bold{w})=\mathbb{E}_\pi{[(v_\pi(S)-\bold{x}(S)^T\bold{w})^2]}
    (여전히 우리가 vπv_\pi 알고 있다고 가정!)
  • Stochastic gradient descent converges on global optimum
    (\because convex optimization)
  • wv^(S,w)=x(S)\nabla_\bold{w}\hat{v}(S,\bold{w})=\bold{x}(S)
    Δw=α(vπ(S)v^(S,w))wv^(S,w)            =α(vπ(S)v^(S,w))x(S)\therefore \Delta\bold{w}=\alpha(v_\pi(S)-\hat{v}(S,\bold{w}))\nabla_{\bold{w}}\hat{v}(S,\bold{w})\\ \space\space\space\space\space\space\space\space\space\space\space\space=\alpha(v_\pi(S)-\hat{v}(S,\bold{w}))\bold{x}(S)
    Update = step-size ×\times prediction error ×\times feature value
    Note: update 크기와 feature value의 크기가 비례한다. 즉, features 중 값이 큰 feature들의 변화량이 더 커야한다.
    Note2: w\bold{w}를 계속 update 하면, v^(s,w)=x(S)Twvπ(s)\hat{v}(s,\bold{w})=\bold{x}(S)^T\bold{w} \approx v_\pi(s)

Table Lookup Features

  • Table lookup is a special case of linear value function approximation

  • xtable(S)=(1(S=s1)1(S=sn))\bold{x}^{table}(S)=\begin{pmatrix} \bold{1}(S=s_1)\\ \vdots\\ \bold{1}(S=s_n) \end{pmatrix}

  • v^(S,w)=(1(S=s1)1(S=sn))(w1wn)\hat{v}(S,\bold{w})=\begin{pmatrix}\bold{1}(S=s_1)\\ \vdots\\ \bold{1}(S=s_n) \end{pmatrix} \cdot \begin{pmatrix}w_1\\ \vdots\\ w_n\end{pmatrix}

    v^(sk,w)=wk\hat{v}(s_k,\bold{w})=w_k

  • w\therefore \bold{w} converges to vπ\mathrm{v}_\pi

Incremental Prediction Algorithm

  • 이제 vπ(S)v_\pi(S)를 모른다고 가정하자. (보다 일반적인 상황)
    • No supervisor, but we have rewards!
  • In practice, we substitute a target for vπ(s)v_\pi(s)
    • 앞에서는 v^π(s)\hat{v}_\pi(s)vπ(s)v_\pi(s)와 가까워지도록 수정했다.
      J(w)=Eπ[(vπ(S)v^(S,w))2]J(\bold{w})=\mathbb{E}_\pi{[(v_\pi(S)-\hat{v}(S,\bold{w}))^2]}
    • 지금은 vπ(s)v_\pi(s)가 없으니 target과 가까워지도록 v^π(s)\hat{v}_\pi(s)를 수정하고자 한다.
      => J(w)=Eπ[targetv^(S,w))2]J(\bold{w})=\mathbb{E}_\pi{[target-\hat{v}(S,\bold{w}))^2]}
      => Δw=12αwJ(w)              =α(targetv^(S,w))wv^(S,w)\Delta\bold{w}=-\cfrac{1}{2}\alpha\nabla_{\bold{w}}J(\bold{w})\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\alpha(target-\hat{v}(S,\bold{w}))\nabla_{\bold{w}}\hat{v}(S,\bold{w}) ... (SGD)
      Note: 단, 이때 target은 상수로 여긴다! 우리가 개선시키고자 하는 방향이 target이기 때문에 TD에서처럼 target에 w\bold{w} term이 들어가더라도 미분 대상이 아니다. TD에서 target을 같이 미분해버리면 기준점(1-step 이후의 estimated value)이 기준점으로서의 역할을 못해서 제대로 된 학습이 이뤄지지 않는다. (참고: 강의영상 45:55~)
  • Targets
    • MC: GtG_t
    • TD(0): Rt+1+γv^(St+1,w)R_{t+1}+\gamma{\hat{v}(S_{t+1},\bold{w})}
    • TD(λ\lambda): GtλG^\lambda_t
  • MC with Value Function Approximation
    • Return GtG_t is an unbiased but noisy sample of vπ(St)v_\pi(S_t)
    • Apply supervised learning to "training data":
      S1,G1,S2,G2,...,ST,GT\lang S_1,G_1 \rang, \lang S_2,G_2 \rang, ..., \lang S_T,G_T \rang
    • Linear MC policy evaluation
      (앞에서 본 linear combination 방식으로 v^\hat{v}를 정의)
      Δw=α(Gtv^(St,w))x(St)\Delta\bold{w}=\alpha(G_t-\hat{v}(S_t,\bold{w}))\bold{x}(S_t)
    • MC with linear val. func. guarantees convergence to global optimum
      MC with non-linear val. func. guarantees convergence to local optimum
  • TD(0) with Value Function Approximation
    • The TD-target Rt+1+γv^(St+1,w)R_{t+1}+\gamma \hat{v}(S_{t+1}, \bold{w}) is a biased sample of true value vπ(St)v_\pi(S_t)
    • Again, apply supervised learning to "training data":
      S1,R2+γv^(S2,w),S2,R3+γv^(S3,w),...,ST,RT\lang S_1, R_2+ \gamma \hat{v}(S_2,\bold{w}) \rang, \lang S_2, R_3+ \gamma \hat{v}(S_3,\bold{w}) \rang, ..., \lang S_T, R_T\rang
    • Linear TD(0) policy evaluation
      Δw=α(Rt+1+γv^(St+1,w)v^(St,w))x(St)\Delta\bold{w}=\alpha(R_{t+1}+\gamma\hat{v}(S_{t+1},\bold{w})-\hat{v}(S_t,\bold{w}))\bold{x}(S_t)
    • Linear TD(0) converges close to global optimum, but doesn't guarantee to obtain global optimum
  • TD(λ\lambda) with Value Function Approximation
    • The λ\lambda-return GtλG_t^\lambda is also a biased sample of true value vπ(s)v_\pi(s)
    • Again, apply supervised learning to "training data":
      S1,G1λ,S2,G2λ,...,ST,GTλ\lang S_1,G_1^\lambda \rang, \lang S_2,G_2^\lambda \rang, ..., \lang S_T,G_T^\lambda \rang
    • Forward view linear TD(λ\lambda)
      Δw=α(Gtλv^(St,w))x(St)\Delta\bold{w}=\alpha(G_t^\lambda-\hat{v}(S_t,\bold{w}))\bold{x}(S_t)
    • Backward view linear TD(λ\lambda)
      δt=Rt+1+γv^(St+1,w)v^(St,w)\delta_t=R_{t+1}+\gamma\hat{v}(S_{t+1},\bold{w})-\hat{v}(S_t,\bold{w})
      Et=γλEt1+x(St)E_t=\gamma\lambda E_{t-1}+\bold{x}(S_t)
      Δw=αδtEt\Delta\bold{w}=\alpha\delta_t E_t
      Note: wv^(St,w)=x(St)\nabla_\bold{w}\hat{v}(S_t,\bold{w})=\bold{x}(S_t), 즉, 각 feature가 v^\hat{v}에 responsible한 만큼 eligibility를 준다.
      또는, 원래는 방문한 state StS_t의 eligibility를 1만큼 증가시키는 방식으로 했었는데, linear FA에선 StS_t의 각 feature의 크기만큼 eligibility를 준다고 생각할 수 있다.

      Note2: EtE_t가 이전에 방문한 state에 대응되는 feature들에 대해서도 weight를 주기 때문에 backward view로 update가 된다!
    • Forward view and backward view linear TD(λ\lambda) are equivalent!

Incremental Control Algorithm

  • Model free control에서 VV가 아닌 QQ를 썼듯이, 여기서도 FA를 action value function qq에 적용한다.
    • Policy Evaluation: Approximate policy evaluation q^(,,w)qπ\hat{q}(\cdot,\cdot,\bold{w}) \approx q_\pi
    • Policy Improvement: ϵ\epsilon-greedy policy improvement
  • Action-Value Function Approximation
    • We want:
      q^(S,A,w)qπ(S,A)\hat{q}(S,A,\bold{w}) \approx q_\pi(S,A)
    • To achieve this,
      J(w)=Eπ[(qπ(S,A)q^(S,A,w))2]J(\bold{w})=\mathbb{E}_\pi[(q_\pi(S,A)-\hat{q}(S,A,\bold{w}))^2] ... MSE
      Note: J(w)J(\bold{w})를 minimize => q^(S,A,w)\hat{q}(S,A,\bold{w})qπ(S,A)q_\pi(S,A)가 가까워짐
    • Use SGD to find a local minimum
      - ww+Δw\bold{w} \gets \bold{w} + \Delta\bold{w}
      - Δw=12αwJ(w)          =α(qπ(S,A)q^(S,A,w))wq^(S,A,w)\Delta\bold{w}=-\frac{1}{2}\alpha\nabla_\bold{w}J(\bold{w})\\ \space\space\space\space\space\space\space\space\space\space=\alpha(q_\pi(S,A)-\hat{q}(S,A,\bold{w}))\nabla_{\bold{w}}\hat{q}(S,A,\bold{w})
  • Linear Action-Value Function Approximation
    • Represent state and action by a feature vector
      x(S,A)=(x1(S,A)xn(S,A))\bold{x}(S,A)=\begin{pmatrix} x_1(S,A)\\ \vdots\\ x_n(S,A) \end{pmatrix}
    • Represent action-value function by linear combination of features
      q^(S,A,w)=x(S,A)Tw=j=1nxj(S,A)wj\hat{q}(S,A,\bold{w})=\bold{x}(S,A)^T\bold{w}=\sum\limits_{j=1}^{n}x_j(S,A)w_j
    • Stochastic gradient descent update
      wq^(S,A,w)=x(S,A)\nabla_\bold{w}\hat{q}(S,A,\bold{w})=\bold{x}(S,A)
      Δw=α(qπ(S,A)q^(S,A,w))x(S,A)\therefore \Delta\bold{w}=\alpha(q_\pi(S,A)-\hat{q}(S,A,\bold{w}))\bold{x}(S,A)
  • Linear Incremental Control Algorithms
    • Like prediction, we must substitute a target for qπ(S,A)q_\pi(S,A)
      • MC: GtG_t
      • TD(0): Rt+1+γq^(St+1,At+1,w)R_{t+1}+\gamma{\hat{q}(S_{t+1},A_{t+1},\bold{w})}
      • TD(λ\lambda): qtλq^\lambda_t
    • For MC, TD(0), or forward-view TD(λ\lambda)
      • Δw=α(targetq^(S,A,w))x(S,A)\Delta\bold{w}=\alpha(target-\hat{q}(S,A,\bold{w}))\bold{x}(S,A)
    • For backward-view TD(λ\lambda),
      • δt=Rt+1+γq^(St+1,At+1,w)q^(St,At,w)\delta_t=R_{t+1}+\gamma\hat{q}(S_{t+1},A_{t+1},\bold{w})-\hat{q}(S_t,A_t,\bold{w})
      • Et=γλEt1+wq^(St,At,w)      =γλEt1+x(St,At)E_t=\gamma\lambda E_{t-1}+\nabla_\bold{w}\hat{q}(S_t,A_t,\bold{w})\\ \space\space\space\space\space\space=\gamma\lambda E_{t-1}+\bold{x}(S_t,A_t)
      • Δw=αδtEt\Delta\bold{w}=\alpha\delta_t E_t

MC vs. TD

  • Example: Mountain Car

    Image from: here
    • Recall, TD(1) \equiv MC
    • MC는 제대로 학습 되지 않는 모습을 보인다. (\because high variance)
      (일반적으로는 bootstrap을 쓰는 게 낫더라!)
  • Example2: Baird's Counterexample (TD(0)로 학습)

    Image from: here
    • parameter θ\theta가 explode
    • TD는 항상 convergence를 보장하지는 않더라!
  • Convergence of Prediction Algorithms
    Image from: here
    • TD가 off-policy나 non-linear FA에서 발산하는 이유는 TD update가 어떤 함수의 gradient가 아니기 때문이다! (앞의 Incremental Prediction Algorithm 부분 참고)
    • Gradient TD follows true gradient of projected Bellman error, thus converges! (자세한 내용은 생략)
  • Convergence of Control Algorithms
    Image from: here
    • Recall,
      Sarsa -> On-policy TD control
      Q-learning => Off-policy TD control

Batch Methods

Batch Reinforcement Learning

  • 지금까지 논의한 알고리즘들은 sample을 순차적으로 하나씩 받아서 weight를 update
  • Simple but not sample efficient (update가 많다)
  • Sample 하나로 한 번만 update (여러번 update 하는 게 더 좋다)
  • Batch methods seek to find the best fitting value function given the agent's experience(=batch)
    i.e., batch 안에 있는 모든 samples에 가장 잘 fit 하는 value function을 찾아준다.

Least Squares Prediction

  • Using linear value function approximation v^(s,w)=x(s)Tw\hat{v}(s,\bold{w})=\bold{x}(s)^T\bold{w}
    and experience D={s1,v1πs2,v2π...,sT,vTπ}\mathcal{D}=\{\lang s_1,v_1^\pi \rang\, \lang s_2,v_2^\pi \rang\, ..., \lang s_T,v_T^\pi \rang\},
    which parameters w\bold{w} give the best fitting value function v^(s,w)\hat{v}(s,\bold{w})?
  • Least square algorithm minimizes sum-squared error between v^(st,w)\hat{v}(s_t,\bold{w}) and target values vtπv_t^\pi
    LS(w)=t=1T(vtπv^(st,w))2LS(\bold{w})=\sum\limits_{t=1}^{T}(v_t^\pi-\hat{v}(s_t,\bold{w}))^2
    • v^\hat{v}w\bold{w}에 대해 linear function이라고 가정했기 때문에 arg minwLS(w)\argmin\limits_{\bold{w}}LS(\bold{w})를 closed form solution으로 바로 찾을 수 있다.
      Let w=arg minwLS(w)\bold{w}^*=\argmin\limits_{\bold{w}}LS(\bold{w}^*), then ED[Δw]=0\mathbb{E}_\mathcal{D}[\Delta\bold{w}^*]=0
          αED[wLS(w)]=0\iff \alpha\mathbb{E}_\mathcal{D}[\nabla_{\bold{w}}LS(\bold{w}^*)]=0
          αt=1T(vtπx(st)Tw)x(st)=0\iff \alpha\sum\limits_{t=1}^{T}(v^\pi_t-\bold{x}(s_t)^T\bold{w}^*)\bold{x}(s_t)=0 ................................ (!)
          t=1Tvtπx(st)=t=1T(x(st)Tw)x(st)\iff \sum\limits_{t=1}^{T}v^\pi_t\bold{x}(s_t)=\sum\limits_{t=1}^{T}(\bold{x}(s_t)^T\bold{w}^*)\bold{x}(s_t)
          t=1Tvtπx(st)=t=1T(x(st)x(st)T)w\iff \sum\limits_{t=1}^{T}v^\pi_t\bold{x}(s_t)=\sum\limits_{t=1}^{T}(\bold{x}(s_t)\bold{x}(s_t)^T)\bold{w}^*
          w=(t=1T(x(st)x(st)T))1t=1Tvtπx(st)\iff \bold{w}^*=\left( \sum\limits_{t=1}^{T}(\bold{x}(s_t)\bold{x}(s_t)^T) \right)^{-1}\cdot\sum\limits_{t=1}^{T}v^\pi_t\bold{x}(s_t)
      (For N features, direct solution time is O(N3)O(N^3))
      • 실제로는 vtπv^\pi_t를 모르는 상태이기 때문에:
        • 1) MC
          ED[Δw]=0\mathbb{E}_\mathcal{D}[\Delta\bold{w}^*]=0 with vtπ=Gtv_t^\pi=G_t
              \implies (!)식에 vtπv^\pi_t 대신 GtG_t 대입
              w=(t=1Tx(st)x(st)T)1t=1TGtx(st)\implies \bold{w}^* = \left( \sum\limits_{t=1}^T\bold{x}(s_t)\bold{x}(s_t)^T \right)^{-1}\cdot\sum\limits_{t=1}^{T}G_t\bold{x}(s_t)
        • 2) TD
          ED[Δw]=0\mathbb{E}_\mathcal{D}[\Delta\bold{w}^*]=0 with vtπ=Rt+1+γv^(St+1,w)v_t^\pi=R_{t+1}+\gamma\hat{v}(S_{t+1},\bold{w})
              \implies (!)식에 vtπv^\pi_t 대신 Rt+1+γv^(St+1,w)R_{t+1}+\gamma\hat{v}(S_{t+1},\bold{w}) 대입
              w=(t=1Tx(st)(x(st)γx(st+1))T)1t=1TRt+1x(st)\implies \bold{w} = \left( \sum\limits_{t=1}^T\bold{x}(s_t)(\bold{x}(s_t)-\gamma\bold{x}(s_{t+1}))^T \right)^{-1}\cdot\sum\limits_{t=1}^{T}R_{t+1}\bold{x}(s_t)
        • 3) TD(λ\lambda)
          ED[Δw]=0\mathbb{E}_\mathcal{D}[\Delta\bold{w}^*]=0 with Backward view TD(λ\lambda)
              t=1TαδtEt=αt=1T(Rt+1+γv^(St+1,w)+v^(st,w))=0\implies \sum\limits_{t=1}^{T}\alpha\delta_t E_t=\alpha\sum\limits_{t=1}^T\left(R_{t+1}+\gamma\hat{v}(S_{t+1},\bold{w})+\hat{v}(s_t,\bold{w})\right)=0
              w=(t=1TEt(x(st)γx(st+1))T)1t=1TRt+1Et\implies \bold{w} = \left( \sum\limits_{t=1}^T E_t(\bold{x}(s_t)-\gamma\bold{x}(s_{t+1}))^T \right)^{-1}\cdot\sum\limits_{t=1}^{T}R_{t+1}E_t
    • Convergence of Least Square Prediction Algorithms
      Image from: here
    • Linear FA + Least Square이면 항상 v^vπ\hat{v} \to v_\pi 수렴이 보장된다! (closed form solution이 존재하기 때문)
    • 앞의 incremental prediction에서는 (linear FA임에도) Off-policy TD에 대해서 수렴을 보장하지 못했다.
    • Non-linear일 때는 적용할 수 없다.
  • Non-linear function이거나 N이 큰 경우에는 w\bold{w}를 directly 구하기 어렵기 때문에 위와 같은 방법을 적용할 수 없다.
    => 앞의 incremental method와 비슷한 방식으로, iterative한 방법을 활용한다.
    => SGD with experience replay ... 하나의 experience에서 반복적으로 sampling 한다.
         단, update가 많다는 문제는 해결하지 못한다.
    • Repeat:
      • Sample state, value from experience
        s,vπD\lang s, v^\pi \rang \sim \mathcal{D}
      • Apply SGD update
        Δw=12αwLS(w)\Delta\bold{w}=-\frac{1}{2}\alpha\nabla_{\bold{w}}LS(\bold{w}) ... 여기선 sample-wise이므로 LS(w)=(vπv^(s,w))2LS(\bold{w})=(v^\pi-\hat{v}(s,\bold{w}))^2
                =α(vπv^(s,w))wv^(s,w)\space\space\space\space\space\space\space\space=\alpha(v^\pi-\hat{v}(s,\bold{w}))\nabla_{\bold{w}}\hat{v}(s,\bold{w})
    • Converges to least squares solution: (replay가 수렴을 도와준다!)
      wπ=arg minwLS(w)\bold{w}^\pi=\argmin\limits_{\bold{w}}LS(\bold{w})
      v^(s,wπ)vπ(s)\hat{v}(s,\bold{w}^\pi) \approx v^\pi(s)
  • Experience Replay in Deep Q-Network (DQN)
    • DQN uses experience replay and fixed Q-targets
      • Experience replay decorrelate the trajectories (stabilize 역할)
      • Fixed Q-targets means that we freeze the old network (ww^-) and bootstrap towards the frozen target (이것도 stabilize 역할, target이 움직이면 불안정하기 때문이다!)
    • DQN Algorithm
      • Take action ata_t (by ϵ\epsilon-greedy)
      • Store transition (st,at,rt+1,st+1)(s_t, a_t, r_{t+1},s_{t+1}) in replay memory D\mathcal{D}
      • Repeat
        • Sample random mini-batch of transitions (s,a,r,s)(s, a, r, s') from D\mathcal{D}
        • Compute Q-learning targets w.r.t. old, fixed paramters w\bold{w}^-
          Note: for a sample, Q-learning target =r+γmaxaQ(s,a;w)=r+\gamma\max\limits_{a'}Q(s',a';\bold{w}^-)
        • Optimize MSE between Q-network and Q-learning targets using variant of SGD
          Li(wi)=Es,a,r,sDi[(r+γmaxaQ(s,a;wi)Q(s,a,;wi))]\mathcal{L}_i(\bold{w}_i)=\mathbb{E}_{s,a,r,s'\sim\mathcal{D}_i}\left[\left(r+\gamma\max\limits_{a'}Q(s',a';\bold{w}^-_i)-Q(s,a,;\bold{w}_i)\right)\right]
          Note: 현재 mini-batch에 대해 w\bold{w}^-로 얻은 one-step further Q에 가장 잘 fit하도록 w\bold{w} 조정
        • ww\bold{w}^- \gets \bold{w}
    • DQN in Atari
      Image from: here
      • Input (Stack of 4 previous frames) = State
      • Output (linear output layer with 18 elements) = Q(s,a)Q(s,a) for 18 joystick/button positions

Least Squares Control

  • Least Squares Policy Iteration
    Image from: here
    • Policy Evaluation: Least squares Q-learning
      Policy Improvement: Greedy policy improvement
  • Least Squares Action-Value Function Approximation
    • Approximate action-value function qπ(s,a)q_\pi(s,a)
      using linear combination of features x(s,a)\bold{x}(s,a),
      q^(s,a,w)=x(s,a)Twqπ(s,a)\hat{q}(s,a,\bold{w})=\bold{x}(s,a)^T\bold{w} \approx q_\pi(s,a)
    • Minimize least squares error between q^(s,a,w)\hat{q}(s,a,\bold{w}) and qπ(s,a)q_\pi(s,a) from experience D\mathcal{D} generated using policy π\pi,
      D={(s1,a1,v1π),(s2,a2,v2π),...,(sT,aT,vTπ)}\mathcal{D}=\{ \lang (s_1,a_1,v_1^\pi), (s_2,a_2,v_2^\pi),..., (s_T,a_T,v_T^\pi) \rang \}
  • Least Squares Q-learning
    Off-policy 상황(old-policy is the behaviour policy)을 생각하고 Q-learning을 적용한다.
    • Use experience generated by old policy
      St,At,Rt+1,St+1πoldS_t, A_t, R_{t+1}, S_{t+1} \sim \pi_{old}
    • Consider alternative successor action A=πnew(St+1)A'=\pi_{new}(S_{t+1})
    • Update q^(St,At,w)\hat{q}(S_t,A_t,\bold{w}) towards value of alternative action,
      Rt+1+γq^(St+1,A,w)R_{t+1}+\gamma\hat{q}(S_{t+1},A',\bold{w})
    • Q-learning Update:
      δ=Rt+1+γq^(St+1,A,w)q^(St,At,w)\delta=R_{t+1}+\gamma\hat{q}(S_{t+1},A',\bold{w})-\hat{q}(S_t,A_t,\bold{w})
      Δw=αδx(St,At)\Delta\bold{w}=\alpha\delta\bold{x}(S_t,A_t)
    • LSTDQ algorithm: Total update = 0
      ED[Δw]=0\mathbb{E}_\mathcal{D}[\Delta\bold{w}]=0
          t=1Tαδx(St,At)=0\iff \sum\limits_{t=1}^T\alpha\delta\bold{x}(S_t,A_t)=0
          w=(t=1Tx(St,At)(x(St,At)γx(St+1,π(St+1)))T)1t=1Tx(St,At)Rt+1\iff \bold{w}=\left( \sum\limits_{t=1}^T\bold{x}(S_t,A_t)(\bold{x}(S_t,A_t)-\gamma\bold{x}(S_{t+1},\pi(S_{t+1})))^T \right)^{-1}\cdot\sum\limits_{t=1}^T\bold{x}(S_{t},A_t)R_{t+1}
  • LSPI-TD with LSTDQ Algorithm
    Image from: here
    • Experience D\mathcal{D}를 반복적으로 evaluate! (매번 update된 policy로 evaluate)
      => 그래서 Off-policy!
    • Lec. 5에서의 Off-policy control by Q-learning 과는 batch(=experience) 단위로 update 된다는 점에서 차이가 있다. (더 효율적)

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

profile
Lazy Enthusiast

0개의 댓글

관련 채용 정보