Silver RL (7) Policy Gradient

hyeok9855.log

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 7: Policy Gradient (Youtube) 강의 내용을 정리했습니다.

Introduction

Policy-Based Reinforcement Learning

지금까지는 (state or action) value function을 구하고, value function으로 policy를 결정했다.
(e.g. $\epsilon$ -greedy)
하지만 문제에 따라 policy보다 value가 더 복잡한 구조일 수 있다. (e.g. 벽돌깨기 게임)
In this lecture, we will directly parametrise the policy
$\pi_\theta(s,a)=\mathbb{P}[a|s,\theta]$
Model-free RL 상황을 가정한다.

Value-Based and Policy-Based RL

Value-Based
- Value function 학습
- Implicit policy (e.g. $\epsilon$ -greedy)
Policy-Based
- No value function
- policy를 directly 학습
Actor-Critic
- Value function 학습
- policy 학습

Image from: here

Advantages of Policy-Based RL

Advantages
- Better convergence properties
- Effective in high-dimensional or continuous action-spaces
  - Value-based ... $\argmax\limits_{a} Q(s,a)$ 연산이 expensive
- Can learn stochastic policies
Disadvantages
- Typically converge to a local rather than global optimum
  - Gradient를 쓰기 때문!
- Evaluating a policy is typically inefficient and has high variance
  - 이것도 gradient를 쓰기 때문! Smoothly update => Stable but inefficient

Example: Aliased Gridworld

Image from: here

The agent cannot differentiate the gray states
- 주변 상황만을 feature로 해서 state를 정의한다고 하면, 두 gray 지점 모두 양쪽이 빈칸이라서 서로 구분이 안 된다. (Partially observable MDP, POMDP)
An optimal deterministic policy will either
1) Move left in gray states
2) Move right in gray states
- Either way, it can get stuck and never reach the money
- $\epsilon$ -greedy 라고 해도 near-deterministic policy라서 바람직하지 않다.
  => Stochastic policy가 적합하다!
An optimal stochastic policy will randomly move left or right in gray states
$\pi_\theta(gray\text{-}state, \to)=0.5$
$\pi_\theta(gray\text{-}state, \gets)=0.5$
Image from: here
- It will reach the goal state in a few steps with high probability
- Policy-based RL can learn the optimal stochastic policy

Policy Objective Functions

Goal: given policy $\pi_\theta(s,a)$ with parameters $\theta$ , find best $\theta$
Three options to measure the quality of $\pi_\theta$
1. Start state가 정해져 있을 때
  $J_1(\theta)=V^{\pi_\theta}(s_1)=\mathbb{E}_{\pi_\theta}[v_1]$
  Intuition: Maximize onward value from $s_1$
  (start state의 distribution이 알려져 있을 때에도 비슷하게 할 수 있다.)
2. Continuing environments, average value
  $J_{avV}(\theta)=\sum\limits_{s}d^{\pi_\theta}(s)V^{\pi_\theta}(s)$
  Here, $d^{\pi_\theta}(s)$ is probability that we're in state $s$ under policy $\pi_\theta$ ,
  and $V^{\pi_\theta}(s)$ is value from that state onward.
3. Continuing environments, average reward per time-step
  $J_{avR}(\theta)=\sum\limits_{s}d^{\pi_\theta}(s)\left[\sum\limits_a\pi_\theta(s,a)\mathcal{R}^a_s\right]$
  Here, $\sum\limits_a\pi_\theta(s,a)\mathcal{R}^a_s$ is average immediate reward.
  Intuition: Maximize immediate reward for every time step

Policy Optimization

Find $\theta$ that maximizes $J(\theta)$
Some approaches do not use gradient
e.g. Hill climbing / Simplex / amoeba / Nelder Mead / Genetic algorithms
Greater efficiency often possible using gradient
e.g. Gradient descent / Conjugate gradient / Quasi-newton
Maximization이라 사실 ascent다!

Finite Difference Policy Gradient

Note: Policy $\pi_\theta$ is determined by parameter $\theta$ and $J(\theta)$ is a policy objective function to maximize

Computing Gradients by Finite Difference

To evaluate policy gradient of $\pi_\theta(s,a)$
For each dimension $k\in[1,n]$
- Estimate $k$ -th partial derivative of objective function w.r.t. $\theta$ by perturbing $\theta$ by small amount $\epsilon$ in $k$ -th dimension.
  $\cfrac{\partial J(\theta)}{\partial\theta_k}\approx\cfrac{J(\theta+\epsilon u_k)-J(\theta)}{\epsilon}$
  where $u_k$ is unit vector with 1 in $k$ -th component, 0 elsewhere
- Update $\theta$
  $\theta_k:=\theta_k+\alpha\cfrac{\partial J(\theta)}{\partial\theta_k}$
Uses n evaluations to compute policy gradient in $n$ dimensions
Simple, noisy, inefficient but sometimes effective
Why inefficient? - For high dimensional problem, this will collapse!
Works for arbitrary policies, even if $J$ is not differentiable!

Monte-Carlo Policy Gradient

Now compute the policy gradient analytically
Assumptions:
1) $\pi_\theta$ is differentiable whenever it is non-zero,
2) we know the gradient $\nabla_\theta\pi_\theta(s,a)$

Likelihood Ratios and Score Function

Likelihood ratios:
$\nabla_\theta\pi_\theta(s,a)=\pi_\theta(s,a)\cfrac{\nabla_\theta\pi_\theta(s,a)}{\pi_\theta(s,a)}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\pi_\theta(s,a)\nabla_\theta\log\pi_\theta(s,a)$
Here, $\nabla_\theta\log\pi_\theta(s,a)$ is called the score function
Note: $\pi_\theta\geq0$ 이므로 score function 방향으로 update 하면 $\pi_\theta(s,a)$ 가 커진다. 즉 state $s$ 에서 action $a$ 를 할 확률이 높아진다. 반대 방향이면 action $a$ 를 할 확률이 낮아진다.
Note2: score = gradient of log likelihood. This indicates the sensitivity of the likelihood.

Softmax Policy and Gaussian Policy

Softmax Policy:
- $\pi_\theta(s,a) = \cfrac{e^{\phi(s,a)^{\top}\theta}}{\sum\limits_{a'}e^{\phi(s,a')^\top\theta}}$
  where $\phi(s,a)$ is an action feature vector
  (probability of an action is proportional to exponentiated weight)
- Softmax score function is
  $\nabla_\theta\log\pi_\theta(s,a)=\nabla_\theta\left[\log{e^{\phi(s,a)^{\top}\theta}} - \log\left({\sum\limits_{a'}e^{\phi(s,a')^{\top}\theta}}\right)\right]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\phi(s,a)-\sum\limits_{a'}\phi(a')\cfrac{e^{\phi(s,a')^\top\theta}}{\sum\limits_{a'}e^{\phi(s,a')^\top\theta}}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\phi(s,a)-\mathbb{E}_{\pi_\theta}[\phi(s,\cdot)]$
  Note: $\phi(s,a)$ is feature vector for the action that we took.
  Note2: $\mathbb{E}_{\pi_\theta}[\phi(s,\cdot)]$ is the average feature for all the actions we might have taken.
  Recall, score function 방향으로 update 하면(e.g. reward $\geq0$ ), $\pi_\theta(s,a)$ 가 커진다. 즉, $\theta$ 가 우리가 했던 action $a$ 를 더 강화하는 쪽으로 update 된다.
Gaussian Policy:
- policy is Gaussian, $a\sim\mathcal{N}(\mu(s),\sigma^2)$
  where $\mu(s)=\phi(s)^\top\theta$ and $\sigma^2$ is fixed (can also be parametrized)
- Gaussian score function is
  $\nabla_\theta\log{\pi_\theta(s,a)}=\cfrac{(a-\mu(s))\phi(s)}{\sigma^2}$
  Note: Gaussian PDF의 로그를 씌우면 $\theta$ 가 없는 term은 상수 취급 할 수 있으므로 $\theta$ 에 대해 미분하면 위 식을 얻는다.

Policy Gradient Theorem

One-Step MDPs:
Starting in state $s \sim d(s)$ and
terminating after one time-step with reward $r=\mathcal{R}_{s,a}$
- Compute policy gradient
  $J(\theta)=\mathbb{E}_{\pi_\theta}[r]\space ...$ expected reward(value) of start point $s$ (one-step!)
  $\space\space\space\space\space\space\space\space\space=\sum\limits_{s\in\mathcal{S}}d(s)\sum\limits_{a\in\mathcal{A}}\pi_\theta(s,a)\mathcal{R}_{s,a}$
  $\nabla_\theta J(\theta)=\sum\limits_{s\in\mathcal{S}}d(s)\sum\limits_{a\in\mathcal{A}}\pi_\theta(s,a)\nabla_\theta\log{\pi_\theta(s,a)\mathcal{R}_{s,a}}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum\limits_{s\in\mathcal{S}}d(s)\sum\limits_{a\in\mathcal{A}}\pi_\theta(s,a)f_\theta(a|s)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum\limits_{s\in\mathcal{S}}d(s)\mathbb{E}_{a}[f_\theta(a|s)]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_s\left[\mathbb{E}_{a}[f_\theta(a|s)]\right]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_s\left[\mathbb{E}_{a}[\nabla_\theta\log\pi_\theta(s,a)r]\right]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(s,a)r\right]$
  Note: $\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log\pi_\theta(s,a)r\right]$ 방향은 곧 $\pi_\theta(s,a)\times \mathcal{R_{s,a}}$ 의 값이 가장 가파르게 커지는 방향이므로, $\theta$ 를 이 방향으로 update 하면 큰 reward를 주는 action에 대한 확률이 커지도록 $\pi_\theta$ 가 update 된다.

Policy Gradient Theorem
- One-step MDP 방법을 multi-step MDP로 generalize
- Replaces instantaneous reward $r$ with long-term value $Q^\pi(s,a)$
  Theorem
  For any differentiable policy $\pi_\theta(s,a)$ ,
  for any of the policy objective functions $J=J_1,J_{avR}, or \frac{1}{1-\gamma}J_{avV},$
  the policy gradient is
  $\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}[\nabla_\theta\log\pi_\theta(s,a)Q^{\pi_\theta}(s,a)]$
  - 증명은 생략!
- 많은 경우에 $Q^{\pi_\theta}(s,a)$ 값을 정확히 알기 어렵다.
  -> Estimate(추정치) 사용! (MC/TD)

Monte-Carlo Policy Gradient (REINFORCE)

Update parameters by stochastic gradient ascent
using policy gradient theorem
using return $G_t$ as an unbiased sample of $Q^{\pi_\theta}(s_t,a_t)$
$\Delta\theta_t=\alpha\nabla_\theta\log\pi_\theta(s_t,a_t)G_t$
REINFORCE Algorithm:

initialize $\theta$ arbitrarily
for each episode $\{s_1,a_1,r_2,...,s_{T-1},a_{T-1},r_T\}\sim\pi_\theta$ do
     for $t=1$ to $T-1$ do
         $\theta\gets\theta+\alpha\nabla_\theta\log\pi_\theta(s_t,a_t)G_t$
     end for
end for
return $\theta$
REINFORCE는 MC를 기반으로 하기 때문에 매우 느리다. ( $\because$ high variance)
Actor-Critic은 value function approximator를 함께 사용함으로써 효율을 높였다.

Actor-Critic Policy Gradient

Reducing Variance Using a Critic

Use a critic(value function approximator) to estimate the action-value function,
- $Q_w(s,a) \approx Q^{\pi_\theta}(s,a)$
  (앞에서 설명한) 기본적인 policy gradient 방식 + action-value function approximation
- $\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_\theta}\left[ \nabla_\theta\log\pi_\theta(s,a)Q_w(s,a) \right]$
- $\Delta\theta = \alpha\nabla_\theta\log\pi_\theta(s,a)Q_w(s,a)$
Update 해야하는 parameter가 2개다!
Actor: Do something ... update policy parameters $\theta$ , in direction suggested by critic
Critic: Evaluate the actor ... update action-value function parameters $w$
The critic is solving a familiar problem: policy evaluation (or prediction)
c.f., Lec4 and Lec6

Simple actor-critic algorithm with linear value function approximator,
$Q_w(s,a)=\phi(s,a)^\top w$

Image from: here
- Actor-Critic도 generalized policy iteration(Lec5)이다.
  - Start off with random policy
  - "Evaluate" it using the critic
  - "Improve" it using policy gradient instead of some greedy algorithm

Compatible Function Approximation

Approximating the policy gradient introduces bias
To avoid the bias, we need to choose special type of value function.

Theorem (Compatible Function Approximation Theorem)
If the following two condition are satisfied:

Value function approximator is compatible to the policy
$\nabla_w Q_w(s,a)=\nabla_\theta\log\pi_\theta(s,a)$

Value function parameters $w$ minimize the mean squared error
$\varepsilon = \mathbb{E}_{\pi_\theta}\left[ (Q^{\pi_\theta}(s,a)-Q_w(s,a))^2 \right]$

Then the policy gradient is exact,
$\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[ \nabla_\theta\log\pi_\theta(s,a)Q_w(s,a) \right]$

증명 생략!

Advantage Function Critic

Recall:
- One-step MDP policy gradient:
  $\theta \gets \theta + \alpha\nabla_\theta J(\theta)$ ,
  where $\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(s,a) r \right]$
- Multi-step MDP
  $r$ 대신 $Q^{\pi_\theta}$
- Actor-Critic
  $Q^{\pi_\theta}$ 대신 $Q_w$
- 문제점: 여전히 variance가 크다!

Reducing variance using a baseline
- Substract a baseline function $B(s)$ from the policy gradient
  Note: $B(s)$ 는 $s$ 에 대한 function
  This reduces variance without changing expectation!
- proof)
  $\mathbb{E}_{\pi_\theta}\left[ \nabla_\theta\log\pi_\theta(s,a)B(s) \right]=\sum\limits_{s\in\mathcal{S}}d^{\pi_\theta}(s)\sum\limits_{a}\nabla_\theta\pi_\theta(s,a)B(s)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum\limits_{s\in\mathcal{S}}d^{\pi_\theta}(s)B(s)\nabla_\theta\sum\limits_{a}\pi_\theta(s,a)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\sum\limits_{s\in\mathcal{S}}d^{\pi_\theta}(s)B(s)\nabla_\theta1\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=0$
  Note: Baseline을 빼줘도 pg의 expectation은 그대로다!
- A good baseline is the state value function,
  $B(s)=V^{\pi_\theta}(s)$
- So we can rewrite the policy gradient using the Advantage Function,
  $A^{\pi_\theta}(s,a)=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)$
  $\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[ \nabla_\theta\log\pi_\theta(s,a)A^{\pi_\theta}(s,a) \right]$
  Note: $A^{\pi_\theta}(s,a)$ ... How much better than usual is it to take action a?
  즉, actor가 $Q^{\pi_\theta}$ 대신 $A^{\pi_\theta}$ 를 써서 $\theta$ 를 update 하게 된다.
Estimating the Advantage Function
- The critic should estimate both $V^{\pi_\theta}(s)$ and $Q^{\pi_\theta}(s,a)$
- Using two function approximators and two parameter vectors,
  $V_v(s) \approx V^{\pi_\theta}(s)$
  $Q_w(s,a) \approx Q^{\pi_\theta}(s,a)$
  $A_{v,w}(s,a)=Q_w(s,a)-V_v(s)$
- And update both value functions by e.g. TD learning
- For the true value function $V^{\pi_\theta}(s)$ , the TD error $\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s') - V^{\pi_\theta}(s)$
  is an unbiased estimate of the advantage function.
  $\because \mathbb{E}_{\pi_\theta}\left[ \delta^{\pi_\theta}|s,a \right] = \mathbb{E}_{\pi_\theta}\left[ r+\gamma V^{\pi_\theta}(s')|s,a \right] - V^{\pi_\theta}(s)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=A^{\pi_\theta}(s,a)$
  So we can use the TD error to compute the policy gradient!
- In practice we can use an approximate TD error
  $\delta_{v} = r + \gamma V_v(s') - V_v(s)$
  Note: V만 estimate 하면 된다! 즉, critic은 $v$ (parameter of $V_v$ )만 update 해주면 된다.

Eligibility Traces

Critic should estimate the value function $V_v$ and there are many target options at different time-scales (c.f., Lec6):
- MC: the return $G_t$
  $\Delta{v}=\alpha(G_t-V_v(s))\nabla_v{V_v(s)}$
- TD(0): the TD target $r+\gamma V_v(s')$
  $\Delta{v}=\alpha(r+\gamma V_v(s')-V_v(s))\nabla_v{V_v(s)}$
- Forward-view TD( $\lambda$ ): the $\lambda$ -return $G_t^\lambda$
  $\Delta{v}=\alpha(G_t^{\lambda}-V_v(s))\nabla_vV_v(s)$
- Backward-view TD( $\lambda$ ):
  $\delta_t=r_{t+1}+\gamma V_v(s_{t+1})-V_v(s_t)$
  $E_t=\gamma\lambda E_{t-1}+\nabla_vV_v(s_t)$
  $\Delta{v}=\alpha\delta_t E_t$
Actor can also estimate the policy gradient at many time-scales:
(Recall, $\nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[ \nabla_\theta\log\pi_\theta(s,a)A^{\pi_\theta}(s,a) \right]$ )
- MC policy gradient uses error from complete return
  $\Delta\theta = \alpha\nabla_\theta\log\pi_\theta(s_t,a_t)(G_t-V_v(s_t))$
- Actor-critic policy gradient uses the one-step TD error
  $\Delta\theta = \alpha\nabla_\theta\log\pi_\theta(s_t,a_t)(r+\gamma V_v (s_{t+1})-V_v(s_t))$
- Just like forward-view TD( $\lambda$ ), we can mix over time-scales
  $\Delta\theta=\alpha\nabla_\theta\log\pi_\theta(s_t,a_t)(G_t^\lambda-V_v(s_t))$
  Note: $G_t^\lambda-V_v(s_t)$ is a biased estimate of advantage function
- Like backward-view TD( $\lambda$ ), we can also use eligibility traces
  $\delta_t=r_{t+1}+\gamma V_v(s_{t+1})-V_v(s_t)$
  $E_t=\lambda E_{t-1}+\nabla_\theta\log\pi_\theta(s_t,a_t)$
  $\Delta\theta=\alpha\delta_t E_t$
  Note: $\nabla_\theta\log\pi_\theta(s,a)$ ... score function에 responsible한 $\theta$ 에 eligibility를 준다.

Others

Deterministic Policy Gradient

지금까지는 policy gradient를 estimate 하기 위해 sampling을 했음.
(Taking expectation of our noisy policies)
But this is bad idea for Gaussian policy!
Hard to estimate true policy gradient.
$\because$ variance of your estimates increases as you approach the optimal policy.
(noise의 영향을 점점 더 많이 받게 됨)
Continuous action space에선 Deterministic한 방법이 더 좋더라!
자세한 건 DPG 논문 참고

Value-based vs. Policy-based

Do they both guarantee the global optimum?
- Value-based with table look-up
  Policy-based with softmax parameterization for each state
  => Guarantee the Global optimum
More general function such as neural network
=> neither value-based nor policy-based guarantee it

Summary of Policy Gradient Algorithms

The policy gradient has many equivalent forms,
Image from: here
모두 같은 direction을 가리키지만(unbiased?) variance가 다르다.
Actor는 stochastic gradient ascent algorithm으로 $\theta$ 를 update 한다.
Critic은 앞의 강의들에서 배운 policy evaluation 방법으로 $Q^\pi(s,a), A^\pi(s,a)$ or $V^\pi(s)$ 를 estimate 한다.

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

Sanghyeok Choi

Lazy Enthusiast

이전 포스트

Silver RL (6) Value Function Approximation

다음 포스트