Silver RL (7) Policy Gradient

Sanghyeok Choi·2022년 1월 16일
0

Intro_to_RL

목록 보기
7/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 7: Policy Gradient (Youtube) 강의 내용을 정리했습니다.

Introduction

Policy-Based Reinforcement Learning

  • 지금까지는 (state or action) value function을 구하고, value function으로 policy를 결정했다.
    (e.g. ϵ\epsilon-greedy)
  • 하지만 문제에 따라 policy보다 value가 더 복잡한 구조일 수 있다. (e.g. 벽돌깨기 게임)
  • In this lecture, we will directly parametrise the policy
    πθ(s,a)=P[as,θ]\pi_\theta(s,a)=\mathbb{P}[a|s,\theta]
  • Model-free RL 상황을 가정한다.

Value-Based and Policy-Based RL

  • Value-Based
    • Value function 학습
    • Implicit policy (e.g. ϵ\epsilon-greedy)
  • Policy-Based
    • No value function
    • policy를 directly 학습
  • Actor-Critic
    • Value function 학습
    • policy 학습

Image from: here

Advantages of Policy-Based RL

  • Advantages
    • Better convergence properties
    • Effective in high-dimensional or continuous action-spaces
      • Value-based ... arg maxaQ(s,a)\argmax\limits_{a} Q(s,a) 연산이 expensive
    • Can learn stochastic policies
  • Disadvantages
    • Typically converge to a local rather than global optimum
      • Gradient를 쓰기 때문!
    • Evaluating a policy is typically inefficient and has high variance
      • 이것도 gradient를 쓰기 때문! Smoothly update => Stable but inefficient

Example: Aliased Gridworld

Image from: here

  • The agent cannot differentiate the gray states
    • 주변 상황만을 feature로 해서 state를 정의한다고 하면, 두 gray 지점 모두 양쪽이 빈칸이라서 서로 구분이 안 된다. (Partially observable MDP, POMDP)
  • An optimal deterministic policy will either
    1) Move left in gray states
    2) Move right in gray states
    • Either way, it can get stuck and never reach the money
    • ϵ\epsilon-greedy 라고 해도 near-deterministic policy라서 바람직하지 않다.
      => Stochastic policy가 적합하다!
  • An optimal stochastic policy will randomly move left or right in gray states
    πθ(gray-state,)=0.5\pi_\theta(gray\text{-}state, \to)=0.5
    πθ(gray-state,)=0.5\pi_\theta(gray\text{-}state, \gets)=0.5
    Image from: here
    • It will reach the goal state in a few steps with high probability
    • Policy-based RL can learn the optimal stochastic policy

Policy Objective Functions

  • Goal: given policy πθ(s,a)\pi_\theta(s,a) with parameters θ\theta, find best θ\theta
  • Three options to measure the quality of πθ\pi_\theta
    1. Start state가 정해져 있을 때
      J1(θ)=Vπθ(s1)=Eπθ[v1]J_1(\theta)=V^{\pi_\theta}(s_1)=\mathbb{E}_{\pi_\theta}[v_1]
      Intuition: Maximize onward value from s1s_1
      (start state의 distribution이 알려져 있을 때에도 비슷하게 할 수 있다.)
    2. Continuing environments, average value
      JavV(θ)=sdπθ(s)Vπθ(s)J_{avV}(\theta)=\sum\limits_{s}d^{\pi_\theta}(s)V^{\pi_\theta}(s)
      Here, dπθ(s)d^{\pi_\theta}(s) is probability that we're in state ss under policy πθ\pi_\theta,
      and Vπθ(s)V^{\pi_\theta}(s) is value from that state onward.
    3. Continuing environments, average reward per time-step
      JavR(θ)=sdπθ(s)[aπθ(s,a)Rsa]J_{avR}(\theta)=\sum\limits_{s}d^{\pi_\theta}(s)\left[\sum\limits_a\pi_\theta(s,a)\mathcal{R}^a_s\right]
      Here, aπθ(s,a)Rsa\sum\limits_a\pi_\theta(s,a)\mathcal{R}^a_s is average immediate reward.
      Intuition: Maximize immediate reward for every time step

Policy Optimization

  • Find θ\theta that maximizes J(θ)J(\theta)
  • Some approaches do not use gradient
    e.g. Hill climbing / Simplex / amoeba / Nelder Mead / Genetic algorithms
  • Greater efficiency often possible using gradient
    e.g. Gradient descent / Conjugate gradient / Quasi-newton
    Maximization이라 사실 ascent다!

Finite Difference Policy Gradient

Note: Policy πθ\pi_\theta is determined by parameter θ\theta and J(θ)J(\theta) is a policy objective function to maximize

Computing Gradients by Finite Difference

  • To evaluate policy gradient of πθ(s,a)\pi_\theta(s,a)
  • For each dimension k[1,n]k\in[1,n]
    • Estimate kk-th partial derivative of objective function w.r.t. θ\theta by perturbing θ\theta by small amount ϵ\epsilon in kk-th dimension.
      J(θ)θkJ(θ+ϵuk)J(θ)ϵ\cfrac{\partial J(\theta)}{\partial\theta_k}\approx\cfrac{J(\theta+\epsilon u_k)-J(\theta)}{\epsilon}
      where uku_k is unit vector with 1 in kk-th component, 0 elsewhere
    • Update θ\theta
      θk:=θk+αJ(θ)θk\theta_k:=\theta_k+\alpha\cfrac{\partial J(\theta)}{\partial\theta_k}
  • Uses n evaluations to compute policy gradient in nn dimensions
  • Simple, noisy, inefficient but sometimes effective
    Why inefficient? - For high dimensional problem, this will collapse!
    Works for arbitrary policies, even if JJ is not differentiable!

Monte-Carlo Policy Gradient

Now compute the policy gradient analytically
Assumptions:
1) πθ\pi_\theta is differentiable whenever it is non-zero,
2) we know the gradient θπθ(s,a)\nabla_\theta\pi_\theta(s,a)

Likelihood Ratios and Score Function

  • Likelihood ratios:
    θπθ(s,a)=πθ(s,a)θπθ(s,a)πθ(s,a)                   =πθ(s,a)θlogπθ(s,a)\nabla_\theta\pi_\theta(s,a)=\pi_\theta(s,a)\cfrac{\nabla_\theta\pi_\theta(s,a)}{\pi_\theta(s,a)}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\pi_\theta(s,a)\nabla_\theta\log\pi_\theta(s,a)
  • Here, θlogπθ(s,a)\nabla_\theta\log\pi_\theta(s,a) is called the score function
    Note: πθ0\pi_\theta\geq0 이므로 score function 방향으로 update 하면 πθ(s,a)\pi_\theta(s,a)가 커진다. 즉 state ss에서 action aa를 할 확률이 높아진다. 반대 방향이면 action aa를 할 확률이 낮아진다.
    Note2: score = gradient of log likelihood. This indicates the sensitivity of the likelihood.

Softmax Policy and Gaussian Policy

  • Softmax Policy:
    • πθ(s,a)=eϕ(s,a)θaeϕ(s,a)θ\pi_\theta(s,a) = \cfrac{e^{\phi(s,a)^{\top}\theta}}{\sum\limits_{a'}e^{\phi(s,a')^\top\theta}}
      where ϕ(s,a)\phi(s,a) is an action feature vector
      (probability of an action is proportional to exponentiated weight)
    • Softmax score function is
      θlogπθ(s,a)=θ[logeϕ(s,a)θlog(aeϕ(s,a)θ)]                          =ϕ(s,a)aϕ(a)eϕ(s,a)θaeϕ(s,a)θ                          =ϕ(s,a)Eπθ[ϕ(s,)]\nabla_\theta\log\pi_\theta(s,a)=\nabla_\theta\left[\log{e^{\phi(s,a)^{\top}\theta}} - \log\left({\sum\limits_{a'}e^{\phi(s,a')^{\top}\theta}}\right)\right]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\phi(s,a)-\sum\limits_{a'}\phi(a')\cfrac{e^{\phi(s,a')^\top\theta}}{\sum\limits_{a'}e^{\phi(s,a')^\top\theta}}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\phi(s,a)-\mathbb{E}_{\pi_\theta}[\phi(s,\cdot)]
      Note: ϕ(s,a)\phi(s,a) is feature vector for the action that we took.
      Note2: Eπθ[ϕ(s,)]\mathbb{E}_{\pi_\theta}[\phi(s,\cdot)] is the average feature for all the actions we might have taken.
      Recall, score function 방향으로 update 하면(e.g. reward 0\geq0), πθ(s,a)\pi_\theta(s,a)가 커진다. 즉, θ\theta가 우리가 했던 action aa를 더 강화하는 쪽으로 update 된다.
  • Gaussian Policy:
    • policy is Gaussian, aN(μ(s),σ2)a\sim\mathcal{N}(\mu(s),\sigma^2)
      where μ(s)=ϕ(s)θ\mu(s)=\phi(s)^\top\theta and σ2\sigma^2 is fixed (can also be parametrized)
    • Gaussian score function is
      θlogπθ(s,a)=(aμ(s))ϕ(s)σ2\nabla_\theta\log{\pi_\theta(s,a)}=\cfrac{(a-\mu(s))\phi(s)}{\sigma^2}
      Note: Gaussian PDF의 로그를 씌우면 θ\theta가 없는 term은 상수 취급 할 수 있으므로 θ\theta에 대해 미분하면 위 식을 얻는다.

Policy Gradient Theorem

  • One-Step MDPs:
    Starting in state sd(s)s \sim d(s) and
    terminating after one time-step with reward r=Rs,ar=\mathcal{R}_{s,a}
    • Compute policy gradient
      J(θ)=Eπθ[r] ...J(\theta)=\mathbb{E}_{\pi_\theta}[r]\space ... expected reward(value) of start point ss (one-step!)
               =sSd(s)aAπθ(s,a)Rs,a\space\space\space\space\space\space\space\space\space=\sum\limits_{s\in\mathcal{S}}d(s)\sum\limits_{a\in\mathcal{A}}\pi_\theta(s,a)\mathcal{R}_{s,a}
      θJ(θ)=sSd(s)aAπθ(s,a)θlogπθ(s,a)Rs,a              =sSd(s)aAπθ(s,a)fθ(as)              =sSd(s)Ea[fθ(as)]              =Es[Ea[fθ(as)]]              =Es[Ea[θlogπθ(s,a)r]]              =Eπθ[θlogπθ(s,a)r]\nabla_\theta J(\theta)=\sum\limits_{s\in\mathcal{S}}d(s)\sum\limits_{a\in\mathcal{A}}\pi_\theta(s,a)\nabla_\theta\log{\pi_\theta(s,a)\mathcal{R}_{s,a}}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum\limits_{s\in\mathcal{S}}d(s)\sum\limits_{a\in\mathcal{A}}\pi_\theta(s,a)f_\theta(a|s)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum\limits_{s\in\mathcal{S}}d(s)\mathbb{E}_{a}[f_\theta(a|s)]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_s\left[\mathbb{E}_{a}[f_\theta(a|s)]\right]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_s\left[\mathbb{E}_{a}[\nabla_\theta\log\pi_\theta(s,a)r]\right]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(s,a)r\right]
      Note: Eπθ[θlogπθ(s,a)r]\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(s,a)r\right] 방향은 곧 πθ(s,a)×Rs,a\pi_\theta(s,a)\times \mathcal{R_{s,a}}의 값이 가장 가파르게 커지는 방향이므로, θ\theta를 이 방향으로 update 하면 큰 reward를 주는 action에 대한 확률이 커지도록 πθ\pi_\theta가 update 된다.
  • Policy Gradient Theorem
    • One-step MDP 방법을 multi-step MDP로 generalize
    • Replaces instantaneous reward rr with long-term value Qπ(s,a)Q^\pi(s,a)

      Theorem
      For any differentiable policy πθ(s,a)\pi_\theta(s,a),
      for any of the policy objective functions J=J1,JavR,or11γJavV,J=J_1,J_{avR}, or \frac{1}{1-\gamma}J_{avV},
      the policy gradient is
      θJ(θ)=Eπθ[θlogπθ(s,a)Qπθ(s,a)]\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}[\nabla_\theta\log\pi_\theta(s,a)Q^{\pi_\theta}(s,a)]

      • 증명은 생략!
    • 많은 경우에 Qπθ(s,a)Q^{\pi_\theta}(s,a) 값을 정확히 알기 어렵다.
      -> Estimate(추정치) 사용! (MC/TD)

Monte-Carlo Policy Gradient (REINFORCE)

  • Update parameters by stochastic gradient ascent
    using policy gradient theorem
    using return GtG_t as an unbiased sample of Qπθ(st,at)Q^{\pi_\theta}(s_t,a_t)
    Δθt=αθlogπθ(st,at)Gt\Delta\theta_t=\alpha\nabla_\theta\log\pi_\theta(s_t,a_t)G_t
  • REINFORCE Algorithm:

    initialize θ\theta arbitrarily
    for each episode {s1,a1,r2,...,sT1,aT1,rT}πθ\{s_1,a_1,r_2,...,s_{T-1},a_{T-1},r_T\}\sim\pi_\theta do
         for t=1t=1 to T1T-1 do
             θθ+αθlogπθ(st,at)Gt\theta\gets\theta+\alpha\nabla_\theta\log\pi_\theta(s_t,a_t)G_t
         end for
    end for
    return θ\theta

  • REINFORCE는 MC를 기반으로 하기 때문에 매우 느리다. (\because high variance)
    Actor-Critic은 value function approximator를 함께 사용함으로써 효율을 높였다.

Actor-Critic Policy Gradient

Reducing Variance Using a Critic

  • Use a critic(value function approximator) to estimate the action-value function,
    • Qw(s,a)Qπθ(s,a)Q_w(s,a) \approx Q^{\pi_\theta}(s,a)
      (앞에서 설명한) 기본적인 policy gradient 방식 + action-value function approximation
    • θJ(θ)Eπθ[θlogπθ(s,a)Qw(s,a)]\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_\theta}\left[ \nabla_\theta\log\pi_\theta(s,a)Q_w(s,a) \right]
    • Δθ=αθlogπθ(s,a)Qw(s,a)\Delta\theta = \alpha\nabla_\theta\log\pi_\theta(s,a)Q_w(s,a)
  • Update 해야하는 parameter가 2개다!
    Actor: Do something ... update policy parameters θ\theta, in direction suggested by critic
    Critic: Evaluate the actor ... update action-value function parameters ww
  • The critic is solving a familiar problem: policy evaluation (or prediction)
    c.f., Lec4 and Lec6
  • Simple actor-critic algorithm with linear value function approximator,
    Qw(s,a)=ϕ(s,a)wQ_w(s,a)=\phi(s,a)^\top w

    Image from: here

    • Actor-Critic도 generalized policy iteration(Lec5)이다.
      • Start off with random policy
      • "Evaluate" it using the critic
      • "Improve" it using policy gradient instead of some greedy algorithm

Compatible Function Approximation

  • Approximating the policy gradient introduces bias
    To avoid the bias, we need to choose special type of value function.

Theorem (Compatible Function Approximation Theorem)
If the following two condition are satisfied:

  1. Value function approximator is compatible to the policy
    wQw(s,a)=θlogπθ(s,a)\nabla_w Q_w(s,a)=\nabla_\theta\log\pi_\theta(s,a)
  2. Value function parameters ww minimize the mean squared error
    ε=Eπθ[(Qπθ(s,a)Qw(s,a))2]\varepsilon = \mathbb{E}_{\pi_\theta}\left[ (Q^{\pi_\theta}(s,a)-Q_w(s,a))^2 \right]

Then the policy gradient is exact,
θJ(θ)=Eπθ[θlogπθ(s,a)Qw(s,a)]\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[ \nabla_\theta\log\pi_\theta(s,a)Q_w(s,a) \right]

  • 증명 생략!

Advantage Function Critic

  • Recall:
    • One-step MDP policy gradient:
      θθ+αθJ(θ)\theta \gets \theta + \alpha\nabla_\theta J(\theta),
      where θJ(θ)=Eπθ[θlogπθ(s,a)r]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(s,a) r \right]
    • Multi-step MDP
      rr 대신 QπθQ^{\pi_\theta}
    • Actor-Critic
      QπθQ^{\pi_\theta} 대신 QwQ_w
    • 문제점: 여전히 variance가 크다!
  • Reducing variance using a baseline
    • Substract a baseline function B(s)B(s) from the policy gradient
      Note: B(s)B(s)ss에 대한 function
      This reduces variance without changing expectation!
    • proof)
      Eπθ[θlogπθ(s,a)B(s)]=sSdπθ(s)aθπθ(s,a)B(s)                                           =sSdπθ(s)B(s)θaπθ(s,a)                                           =sSdπθ(s)B(s)θ1                                           =0\mathbb{E}_{\pi_\theta}\left[ \nabla_\theta\log\pi_\theta(s,a)B(s) \right]=\sum\limits_{s\in\mathcal{S}}d^{\pi_\theta}(s)\sum\limits_{a}\nabla_\theta\pi_\theta(s,a)B(s)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum\limits_{s\in\mathcal{S}}d^{\pi_\theta}(s)B(s)\nabla_\theta\sum\limits_{a}\pi_\theta(s,a)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum\limits_{s\in\mathcal{S}}d^{\pi_\theta}(s)B(s)\nabla_\theta1\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=0
      Note: Baseline을 빼줘도 pg의 expectation은 그대로다!
    • A good baseline is the state value function,
      B(s)=Vπθ(s)B(s)=V^{\pi_\theta}(s)
    • So we can rewrite the policy gradient using the Advantage Function,
      Aπθ(s,a)=Qπθ(s,a)Vπθ(s)A^{\pi_\theta}(s,a)=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)
      θJ(θ)=Eπθ[θlogπθ(s,a)Aπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[ \nabla_\theta\log\pi_\theta(s,a)A^{\pi_\theta}(s,a) \right]
      Note: Aπθ(s,a)A^{\pi_\theta}(s,a) ... How much better than usual is it to take action a?
      즉, actorQπθQ^{\pi_\theta} 대신 AπθA^{\pi_\theta}를 써서 θ\theta를 update 하게 된다.
  • Estimating the Advantage Function
    • The critic should estimate both Vπθ(s)V^{\pi_\theta}(s) and Qπθ(s,a)Q^{\pi_\theta}(s,a)
    • Using two function approximators and two parameter vectors,
      Vv(s)Vπθ(s)V_v(s) \approx V^{\pi_\theta}(s)
      Qw(s,a)Qπθ(s,a)Q_w(s,a) \approx Q^{\pi_\theta}(s,a)
      Av,w(s,a)=Qw(s,a)Vv(s)A_{v,w}(s,a)=Q_w(s,a)-V_v(s)
    • And update both value functions by e.g. TD learning
    • For the true value function Vπθ(s)V^{\pi_\theta}(s), the TD error δπθ=r+γVπθ(s)Vπθ(s)\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s') - V^{\pi_\theta}(s)
      is an unbiased estimate of the advantage function
      .
      Eπθ[δπθs,a]=Eπθ[r+γVπθ(s)s,a]Vπθ(s)                          =Qπθ(s,a)Vπθ(s)                          =Aπθ(s,a)\because \mathbb{E}_{\pi_\theta}\left[ \delta^{\pi_\theta}|s,a \right] = \mathbb{E}_{\pi_\theta}\left[ r+\gamma V^{\pi_\theta}(s')|s,a \right] - V^{\pi_\theta}(s)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=A^{\pi_\theta}(s,a)
      So we can use the TD error to compute the policy gradient!
    • In practice we can use an approximate TD error
      δv=r+γVv(s)Vv(s)\delta_{v} = r + \gamma V_v(s') - V_v(s)
      Note: V만 estimate 하면 된다! 즉, critic은 vv(parameter of VvV_v)만 update 해주면 된다.

Eligibility Traces

  • Critic should estimate the value function VvV_v and there are many target options at different time-scales (c.f., Lec6):
    • MC: the return GtG_t
      Δv=α(GtVv(s))vVv(s)\Delta{v}=\alpha(G_t-V_v(s))\nabla_v{V_v(s)}
    • TD(0): the TD target r+γVv(s)r+\gamma V_v(s')
      Δv=α(r+γVv(s)Vv(s))vVv(s)\Delta{v}=\alpha(r+\gamma V_v(s')-V_v(s))\nabla_v{V_v(s)}
    • Forward-view TD(λ\lambda): the λ\lambda-return GtλG_t^\lambda
      Δv=α(GtλVv(s))vVv(s)\Delta{v}=\alpha(G_t^{\lambda}-V_v(s))\nabla_vV_v(s)
    • Backward-view TD(λ\lambda):
      δt=rt+1+γVv(st+1)Vv(st)\delta_t=r_{t+1}+\gamma V_v(s_{t+1})-V_v(s_t)
      Et=γλEt1+vVv(st)E_t=\gamma\lambda E_{t-1}+\nabla_vV_v(s_t)
      Δv=αδtEt\Delta{v}=\alpha\delta_t E_t
  • Actor can also estimate the policy gradient at many time-scales:
    (Recall, θJ(θ)=Eπθ[θlogπθ(s,a)Aπθ(s,a)]\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[ \nabla_\theta\log\pi_\theta(s,a)A^{\pi_\theta}(s,a) \right])
    • MC policy gradient uses error from complete return
      Δθ=αθlogπθ(st,at)(GtVv(st))\Delta\theta = \alpha\nabla_\theta\log\pi_\theta(s_t,a_t)(G_t-V_v(s_t))
    • Actor-critic policy gradient uses the one-step TD error
      Δθ=αθlogπθ(st,at)(r+γVv(st+1)Vv(st))\Delta\theta = \alpha\nabla_\theta\log\pi_\theta(s_t,a_t)(r+\gamma V_v (s_{t+1})-V_v(s_t))
    • Just like forward-view TD(λ\lambda), we can mix over time-scales
      Δθ=αθlogπθ(st,at)(GtλVv(st))\Delta\theta=\alpha\nabla_\theta\log\pi_\theta(s_t,a_t)(G_t^\lambda-V_v(s_t))
      Note: GtλVv(st)G_t^\lambda-V_v(s_t) is a biased estimate of advantage function
    • Like backward-view TD(λ\lambda), we can also use eligibility traces
      δt=rt+1+γVv(st+1)Vv(st)\delta_t=r_{t+1}+\gamma V_v(s_{t+1})-V_v(s_t)
      Et=λEt1+θlogπθ(st,at)E_t=\lambda E_{t-1}+\nabla_\theta\log\pi_\theta(s_t,a_t)
      Δθ=αδtEt\Delta\theta=\alpha\delta_t E_t
      Note: θlogπθ(s,a)\nabla_\theta\log\pi_\theta(s,a) ... score function에 responsible한 θ\theta에 eligibility를 준다.

Others

Deterministic Policy Gradient

  • 지금까지는 policy gradient를 estimate 하기 위해 sampling을 했음.
    (Taking expectation of our noisy policies)
  • But this is bad idea for Gaussian policy!
    Hard to estimate true policy gradient.
    \because variance of your estimates increases as you approach the optimal policy.
    (noise의 영향을 점점 더 많이 받게 됨)
  • Continuous action space에선 Deterministic한 방법이 더 좋더라!
    자세한 건 DPG 논문 참고

Value-based vs. Policy-based

  • Do they both guarantee the global optimum?
    • Value-based with table look-up
      Policy-based with softmax parameterization for each state
      => Guarantee the Global optimum
  • More general function such as neural network
    => neither value-based nor policy-based guarantee it

Summary of Policy Gradient Algorithms

  • The policy gradient has many equivalent forms,
    Image from: here
    모두 같은 direction을 가리키지만(unbiased?) variance가 다르다.
  • Actor는 stochastic gradient ascent algorithm으로 θ\theta를 update 한다.
  • Critic은 앞의 강의들에서 배운 policy evaluation 방법으로 Qπ(s,a),Aπ(s,a)Q^\pi(s,a), A^\pi(s,a) or Vπ(s)V^\pi(s)를 estimate 한다.

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

profile
Lazy Enthusiast

0개의 댓글

관련 채용 정보