Silver RL (6) Value Function Approximation

Sanghyeok Choi·2022년 1월 2일

Intro_to_RL

목록 보기

6/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 6: Value Function Approximation (Youtube) 강의 내용을 정리했습니다.

Introduction

Large-Scale Reinforcement Learning

RL can be used to solve large problems, e.g.
- Backgammon: $10^{20}$ states
- Computer Go: $10^{170}$ states
- Helicopter: continuous state space
이때까지 우리는 $Q(s,a)$ 를 lookup table로 생각하고 각 table의 값을 update 해왔다.
하지만 state가 위 예시들처럼 엄청 많거나 continuous space인 경우엔 table을 만들 수 없거나 지나치게 비효율적이다.
How can we scale up the model-free methods for prediction and control from the last two lectures?
=> Value function approximation

Value Function Approximation

For large MDPs:
- Estimate value function with function approximation
  - $\hat{v}(s,\bold{w}) \approx v_\pi(s)$
  - $\hat{q}(s,a,\bold{w}) \approx q_\pi(s,a)$
- Generalize from seen states to unseen states
- Update parameter $\bold{w}$ using MC or TD learning
Function Approximators
- Linear combinations of features
- Neural network
- Decision tree
- Nearest neighbor
- ...
Note: We will only consider differentiable approximators, the "Linear combinations of features" and the "Neural network".

Incremental Methods

Gradient Descent

Let $J(\bold{w})$ be a differentiable function of parameter vector $\bold{w}$
Gradient of $J(\bold{w})$ is:
$\nabla_{\bold{w}}J(\bold{w}) = \begin{pmatrix} \cfrac{\partial{J(\bold{w})}}{\partial{w_1}}\\ \vdots\\ \cfrac{\partial{J(\bold{w})}}{\partial{w_n}} \end{pmatrix}$
To find a local minimum of $J(\bold{w})$ , adjust $\bold{w}$ in direction of $-\nabla_{\bold{w}}J(\bold{w})$
i.e., $\bold{w} \gets \bold{w} - \cfrac{1}{2}\alpha\nabla_{\bold{w}}J(\bold{w})$ , where $\alpha$ is a step-size parameter

Value Function Approx. By Stochastic Gradient Descent

$v_\pi(s)$ 를 알고 있다고 가정하고, 우리의 value function approximator를 $\hat{v}(s,\bold{w})$ 라고 하자.
Let $J(\bold{w})=\mathbb{E}_\pi{[(v_\pi(S)-\hat{v}(S,\bold{w}))^2]}$
(MSE between $v_\pi(s)$ and $\hat{v}(s,\bold{w})$ , $J$ 를 최소화하면 $v$ 와 $\hat{v}$ 가 가까워진다.)
Then,
$\nabla_{\bold{w}}J(\bold{w})=-\mathbb{E}_\pi{[2\times(v_\pi(S)-\hat{v}(S,\bold{w}))\nabla_{\bold{w}}\hat{v}(S,\bold{w})]}$ ,
and,
$\Delta\bold{w}=-\cfrac{1}{2}\alpha\nabla_{\bold{w}}J(\bold{w})\\ \space\space\space\space\space\space\space\space=\alpha\mathbb{E}_\pi{[(v_\pi(S)-\hat{v}(S,\bold{w}))\nabla_{\bold{w}}\hat{v}(S,\bold{w})]}$
Stochastic gradient descent samples the gradient
$\Delta\bold{w}=\alpha(v_\pi(S)-\hat{v}(S,\bold{w}))\nabla_{\bold{w}}\hat{v}(S,\bold{w})$

Linear Value Function Approximation

Represent state by a feature vector
$\bold{x}(S)=\begin{pmatrix} x_1(S)\\ \vdots\\ x_n(S) \end{pmatrix}$
- e.g. robot의 3차원 좌표 (기준점에서의 xyz 거리), 포트폴리오에 있는 주식들의 가격
Represent value function by a linear combination of features
$\hat{v}(S,\bold{w})=\bold{x}(S)^T\bold{w}=\sum\limits_{j=1}^{n}x_j(S)w_j$
Objective function is quadratic in parameters $\bold{w}$
$J(\bold{w})=\mathbb{E}_\pi{[(v_\pi(S)-\bold{x}(S)^T\bold{w})^2]}$
(여전히 우리가 $v_\pi$ 알고 있다고 가정!)
Stochastic gradient descent converges on global optimum
( $\because$ convex optimization)
$\nabla_\bold{w}\hat{v}(S,\bold{w})=\bold{x}(S)$
$\therefore \Delta\bold{w}=\alpha(v_\pi(S)-\hat{v}(S,\bold{w}))\nabla_{\bold{w}}\hat{v}(S,\bold{w})\\ \space\space\space\space\space\space\space\space\space\space\space\space=\alpha(v_\pi(S)-\hat{v}(S,\bold{w}))\bold{x}(S)$
Update = step-size $\times$ prediction error $\times$ feature value
Note: update 크기와 feature value의 크기가 비례한다. 즉, features 중 값이 큰 feature들의 변화량이 더 커야한다.
Note2: $\bold{w}$ 를 계속 update 하면, $\hat{v}(s,\bold{w})=\bold{x}(S)^T\bold{w} \approx v_\pi(s)$

Table Lookup Features

Table lookup is a special case of linear value function approximation
$\bold{x}^{table}(S)=\begin{pmatrix} \bold{1}(S=s_1)\\ \vdots\\ \bold{1}(S=s_n) \end{pmatrix}$

$\hat{v}(S,\bold{w})=\begin{pmatrix}\bold{1}(S=s_1)\\ \vdots\\ \bold{1}(S=s_n) \end{pmatrix} \cdot \begin{pmatrix}w_1\\ \vdots\\ w_n\end{pmatrix}$

$\hat{v}(s_k,\bold{w})=w_k$
$\therefore \bold{w}$ converges to $\mathrm{v}_\pi$

Incremental Prediction Algorithm

이제 $v_\pi(S)$ 를 모른다고 가정하자. (보다 일반적인 상황)
- No supervisor, but we have rewards!
In practice, we substitute a target for $v_\pi(s)$
- 앞에서는 $\hat{v}_\pi(s)$ 가 $v_\pi(s)$ 와 가까워지도록 수정했다.
  $J(\bold{w})=\mathbb{E}_\pi{[(v_\pi(S)-\hat{v}(S,\bold{w}))^2]}$
- 지금은 $v_\pi(s)$ 가 없으니 target과 가까워지도록 $\hat{v}_\pi(s)$ 를 수정하고자 한다.
  => $J(\bold{w})=\mathbb{E}_\pi{[target-\hat{v}(S,\bold{w}))^2]}$
  => $\Delta\bold{w}=-\cfrac{1}{2}\alpha\nabla_{\bold{w}}J(\bold{w})\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\alpha(target-\hat{v}(S,\bold{w}))\nabla_{\bold{w}}\hat{v}(S,\bold{w})$ ... (SGD)
  Note: 단, 이때 target은 상수로 여긴다! 우리가 개선시키고자 하는 방향이 target이기 때문에 TD에서처럼 target에 $\bold{w}$ term이 들어가더라도 미분 대상이 아니다. TD에서 target을 같이 미분해버리면 기준점(1-step 이후의 estimated value)이 기준점으로서의 역할을 못해서 제대로 된 학습이 이뤄지지 않는다. (참고: 강의영상 45:55~)
Targets
- MC: $G_t$
- TD(0): $R_{t+1}+\gamma{\hat{v}(S_{t+1},\bold{w})}$
- TD( $\lambda$ ): $G^\lambda_t$

MC with Value Function Approximation
- Return $G_t$ is an unbiased but noisy sample of $v_\pi(S_t)$
- Apply supervised learning to "training data":
  $\lang S_1,G_1 \rang, \lang S_2,G_2 \rang, ..., \lang S_T,G_T \rang$
- Linear MC policy evaluation
  (앞에서 본 linear combination 방식으로 $\hat{v}$ 를 정의)
  $\Delta\bold{w}=\alpha(G_t-\hat{v}(S_t,\bold{w}))\bold{x}(S_t)$
- MC with linear val. func. guarantees convergence to global optimum
  MC with non-linear val. func. guarantees convergence to local optimum

TD(0) with Value Function Approximation
- The TD-target $R_{t+1}+\gamma \hat{v}(S_{t+1}, \bold{w})$ is a biased sample of true value $v_\pi(S_t)$
- Again, apply supervised learning to "training data":
  $\lang S_1, R_2+ \gamma \hat{v}(S_2,\bold{w}) \rang, \lang S_2, R_3+ \gamma \hat{v}(S_3,\bold{w}) \rang, ..., \lang S_T, R_T\rang$
- Linear TD(0) policy evaluation
  $\Delta\bold{w}=\alpha(R_{t+1}+\gamma\hat{v}(S_{t+1},\bold{w})-\hat{v}(S_t,\bold{w}))\bold{x}(S_t)$
- Linear TD(0) converges close to global optimum, but doesn't guarantee to obtain global optimum

TD( $\lambda$ ) with Value Function Approximation
- The $\lambda$ -return $G_t^\lambda$ is also a biased sample of true value $v_\pi(s)$
- Again, apply supervised learning to "training data":
  $\lang S_1,G_1^\lambda \rang, \lang S_2,G_2^\lambda \rang, ..., \lang S_T,G_T^\lambda \rang$
- Forward view linear TD( $\lambda$ )
  $\Delta\bold{w}=\alpha(G_t^\lambda-\hat{v}(S_t,\bold{w}))\bold{x}(S_t)$
- Backward view linear TD( $\lambda$ )
  $\delta_t=R_{t+1}+\gamma\hat{v}(S_{t+1},\bold{w})-\hat{v}(S_t,\bold{w})$
  $E_t=\gamma\lambda E_{t-1}+\bold{x}(S_t)$
  $\Delta\bold{w}=\alpha\delta_t E_t$
  Note: $\nabla_\bold{w}\hat{v}(S_t,\bold{w})=\bold{x}(S_t)$ , 즉, 각 feature가 $\hat{v}$ 에 responsible한 만큼 eligibility를 준다.
  또는, 원래는 방문한 state $S_t$ 의 eligibility를 1만큼 증가시키는 방식으로 했었는데, linear FA에선 $S_t$ 의 각 feature의 크기만큼 eligibility를 준다고 생각할 수 있다.
  Note2: $E_t$ 가 이전에 방문한 state에 대응되는 feature들에 대해서도 weight를 주기 때문에 backward view로 update가 된다!
- Forward view and backward view linear TD( $\lambda$ ) are equivalent!

Incremental Control Algorithm

Model free control에서 $V$ 가 아닌 $Q$ 를 썼듯이, 여기서도 FA를 action value function $q$ 에 적용한다.
- Policy Evaluation: Approximate policy evaluation $\hat{q}(\cdot,\cdot,\bold{w}) \approx q_\pi$
- Policy Improvement: $\epsilon$ -greedy policy improvement

Action-Value Function Approximation
- We want:
  $\hat{q}(S,A,\bold{w}) \approx q_\pi(S,A)$
- To achieve this,
  $J(\bold{w})=\mathbb{E}_\pi[(q_\pi(S,A)-\hat{q}(S,A,\bold{w}))^2]$ ... MSE
  Note: $J(\bold{w})$ 를 minimize => $\hat{q}(S,A,\bold{w})$ 와 $q_\pi(S,A)$ 가 가까워짐
- Use SGD to find a local minimum
  - $\bold{w} \gets \bold{w} + \Delta\bold{w}$
  - $\Delta\bold{w}=-\frac{1}{2}\alpha\nabla_\bold{w}J(\bold{w})\\ \space\space\space\space\space\space\space\space\space\space=\alpha(q_\pi(S,A)-\hat{q}(S,A,\bold{w}))\nabla_{\bold{w}}\hat{q}(S,A,\bold{w})$

Linear Action-Value Function Approximation
- Represent state and action by a feature vector
  $\bold{x}(S,A)=\begin{pmatrix} x_1(S,A)\\ \vdots\\ x_n(S,A) \end{pmatrix}$
- Represent action-value function by linear combination of features
  $\hat{q}(S,A,\bold{w})=\bold{x}(S,A)^T\bold{w}=\sum\limits_{j=1}^{n}x_j(S,A)w_j$
- Stochastic gradient descent update
  $\nabla_\bold{w}\hat{q}(S,A,\bold{w})=\bold{x}(S,A)$
  $\therefore \Delta\bold{w}=\alpha(q_\pi(S,A)-\hat{q}(S,A,\bold{w}))\bold{x}(S,A)$

Linear Incremental Control Algorithms
- Like prediction, we must substitute a target for $q_\pi(S,A)$
  - MC: $G_t$
  - TD(0): $R_{t+1}+\gamma{\hat{q}(S_{t+1},A_{t+1},\bold{w})}$
  - TD( $\lambda$ ): $q^\lambda_t$
- For MC, TD(0), or forward-view TD( $\lambda$ )
  - $\Delta\bold{w}=\alpha(target-\hat{q}(S,A,\bold{w}))\bold{x}(S,A)$
- For backward-view TD( $\lambda$ ),
  - $\delta_t=R_{t+1}+\gamma\hat{q}(S_{t+1},A_{t+1},\bold{w})-\hat{q}(S_t,A_t,\bold{w})$
  - $E_t=\gamma\lambda E_{t-1}+\nabla_\bold{w}\hat{q}(S_t,A_t,\bold{w})\\ \space\space\space\space\space\space=\gamma\lambda E_{t-1}+\bold{x}(S_t,A_t)$
  - $\Delta\bold{w}=\alpha\delta_t E_t$

MC vs. TD

Example: Mountain Car

Image from: here
- Recall, TD(1) $\equiv$ MC
- MC는 제대로 학습 되지 않는 모습을 보인다. ( $\because$ high variance)
  (일반적으로는 bootstrap을 쓰는 게 낫더라!)

Example2: Baird's Counterexample (TD(0)로 학습)

Image from: here
- parameter $\theta$ 가 explode
- TD는 항상 convergence를 보장하지는 않더라!

Convergence of Prediction Algorithms
Image from: here
- TD가 off-policy나 non-linear FA에서 발산하는 이유는 TD update가 어떤 함수의 gradient가 아니기 때문이다! (앞의 Incremental Prediction Algorithm 부분 참고)
- Gradient TD follows true gradient of projected Bellman error, thus converges! (자세한 내용은 생략)

Convergence of Control Algorithms
Image from: here
- Recall,
  Sarsa -> On-policy TD control
  Q-learning => Off-policy TD control

Batch Methods

Batch Reinforcement Learning

지금까지 논의한 알고리즘들은 sample을 순차적으로 하나씩 받아서 weight를 update
Simple but not sample efficient (update가 많다)
Sample 하나로 한 번만 update (여러번 update 하는 게 더 좋다)
Batch methods seek to find the best fitting value function given the agent's experience(=batch)
i.e., batch 안에 있는 모든 samples에 가장 잘 fit 하는 value function을 찾아준다.

Least Squares Prediction

Using linear value function approximation $\hat{v}(s,\bold{w})=\bold{x}(s)^T\bold{w}$
and experience $\mathcal{D}=\{\lang s_1,v_1^\pi \rang\, \lang s_2,v_2^\pi \rang\, ..., \lang s_T,v_T^\pi \rang\}$ ,
which parameters $\bold{w}$ give the best fitting value function $\hat{v}(s,\bold{w})$ ?

Least square algorithm minimizes sum-squared error between $\hat{v}(s_t,\bold{w})$ and target values $v_t^\pi$
$LS(\bold{w})=\sum\limits_{t=1}^{T}(v_t^\pi-\hat{v}(s_t,\bold{w}))^2$
- $\hat{v}$ 가 $\bold{w}$ 에 대해 linear function이라고 가정했기 때문에 $\argmin\limits_{\bold{w}}LS(\bold{w})$ 를 closed form solution으로 바로 찾을 수 있다.
  Let $\bold{w}^*=\argmin\limits_{\bold{w}}LS(\bold{w}^*)$ , then $\mathbb{E}_\mathcal{D}[\Delta\bold{w}^*]=0$
  $\iff \alpha\mathbb{E}_\mathcal{D}[\nabla_{\bold{w}}LS(\bold{w}^*)]=0$
  $\iff \alpha\sum\limits_{t=1}^{T}(v^\pi_t-\bold{x}(s_t)^T\bold{w}^*)\bold{x}(s_t)=0$ ................................ (!)
  $\iff \sum\limits_{t=1}^{T}v^\pi_t\bold{x}(s_t)=\sum\limits_{t=1}^{T}(\bold{x}(s_t)^T\bold{w}^*)\bold{x}(s_t)$
  $\iff \sum\limits_{t=1}^{T}v^\pi_t\bold{x}(s_t)=\sum\limits_{t=1}^{T}(\bold{x}(s_t)\bold{x}(s_t)^T)\bold{w}^*$
  $\iff \bold{w}^*=\left( \sum\limits_{t=1}^{T}(\bold{x}(s_t)\bold{x}(s_t)^T) \right)^{-1}\cdot\sum\limits_{t=1}^{T}v^\pi_t\bold{x}(s_t)$
  (For N features, direct solution time is $O(N^3)$ )
  - 실제로는 $v^\pi_t$ 를 모르는 상태이기 때문에:
    - 1) MC
      $\mathbb{E}_\mathcal{D}[\Delta\bold{w}^*]=0$ with $v_t^\pi=G_t$
      $\implies$ (!)식에 $v^\pi_t$ 대신 $G_t$ 대입
      $\implies \bold{w}^* = \left( \sum\limits_{t=1}^T\bold{x}(s_t)\bold{x}(s_t)^T \right)^{-1}\cdot\sum\limits_{t=1}^{T}G_t\bold{x}(s_t)$
    - 2) TD
      $\mathbb{E}_\mathcal{D}[\Delta\bold{w}^*]=0$ with $v_t^\pi=R_{t+1}+\gamma\hat{v}(S_{t+1},\bold{w})$
      $\implies$ (!)식에 $v^\pi_t$ 대신 $R_{t+1}+\gamma\hat{v}(S_{t+1},\bold{w})$ 대입
      $\implies \bold{w} = \left( \sum\limits_{t=1}^T\bold{x}(s_t)(\bold{x}(s_t)-\gamma\bold{x}(s_{t+1}))^T \right)^{-1}\cdot\sum\limits_{t=1}^{T}R_{t+1}\bold{x}(s_t)$
    - 3) TD( $\lambda$ )
      $\mathbb{E}_\mathcal{D}[\Delta\bold{w}^*]=0$ with Backward view TD( $\lambda$ )
      $\implies \sum\limits_{t=1}^{T}\alpha\delta_t E_t=\alpha\sum\limits_{t=1}^T\left(R_{t+1}+\gamma\hat{v}(S_{t+1},\bold{w})+\hat{v}(s_t,\bold{w})\right)=0$
      $\implies \bold{w} = \left( \sum\limits_{t=1}^T E_t(\bold{x}(s_t)-\gamma\bold{x}(s_{t+1}))^T \right)^{-1}\cdot\sum\limits_{t=1}^{T}R_{t+1}E_t$
- Convergence of Least Square Prediction Algorithms
  Image from: here
- Linear FA + Least Square이면 항상 $\hat{v} \to v_\pi$ 수렴이 보장된다! (closed form solution이 존재하기 때문)
- 앞의 incremental prediction에서는 (linear FA임에도) Off-policy TD에 대해서 수렴을 보장하지 못했다.
- Non-linear일 때는 적용할 수 없다.

Non-linear function이거나 N이 큰 경우에는 $\bold{w}$ 를 directly 구하기 어렵기 때문에 위와 같은 방법을 적용할 수 없다.
=> 앞의 incremental method와 비슷한 방식으로, iterative한 방법을 활용한다.
=> SGD with experience replay ... 하나의 experience에서 반복적으로 sampling 한다.
단, update가 많다는 문제는 해결하지 못한다.
- Repeat:
  - Sample state, value from experience
    $\lang s, v^\pi \rang \sim \mathcal{D}$
  - Apply SGD update
    $\Delta\bold{w}=-\frac{1}{2}\alpha\nabla_{\bold{w}}LS(\bold{w})$ ... 여기선 sample-wise이므로 $LS(\bold{w})=(v^\pi-\hat{v}(s,\bold{w}))^2$
    $\space\space\space\space\space\space\space\space=\alpha(v^\pi-\hat{v}(s,\bold{w}))\nabla_{\bold{w}}\hat{v}(s,\bold{w})$
- Converges to least squares solution: (replay가 수렴을 도와준다!)
  $\bold{w}^\pi=\argmin\limits_{\bold{w}}LS(\bold{w})$
  $\hat{v}(s,\bold{w}^\pi) \approx v^\pi(s)$

Experience Replay in Deep Q-Network (DQN)
- DQN uses experience replay and fixed Q-targets
  - Experience replay decorrelate the trajectories (stabilize 역할)
  - Fixed Q-targets means that we freeze the old network ( $w^-$ ) and bootstrap towards the frozen target (이것도 stabilize 역할, target이 움직이면 불안정하기 때문이다!)
- DQN Algorithm
  - Take action $a_t$ (by $\epsilon$ -greedy)
  - Store transition $(s_t, a_t, r_{t+1},s_{t+1})$ in replay memory $\mathcal{D}$
  - Repeat
    - Sample random mini-batch of transitions $(s, a, r, s')$ from $\mathcal{D}$
    - Compute Q-learning targets w.r.t. old, fixed paramters $\bold{w}^-$
      Note: for a sample, Q-learning target $=r+\gamma\max\limits_{a'}Q(s',a';\bold{w}^-)$
    - Optimize MSE between Q-network and Q-learning targets using variant of SGD
      $\mathcal{L}_i(\bold{w}_i)=\mathbb{E}_{s,a,r,s'\sim\mathcal{D}_i}\left[\left(r+\gamma\max\limits_{a'}Q(s',a';\bold{w}^-_i)-Q(s,a,;\bold{w}_i)\right)\right]$
      Note: 현재 mini-batch에 대해 $\bold{w}^-$ 로 얻은 one-step further Q에 가장 잘 fit하도록 $\bold{w}$ 조정
    - $\bold{w}^- \gets \bold{w}$
- DQN in Atari
  Image from: here
  - Input (Stack of 4 previous frames) = State
  - Output (linear output layer with 18 elements) = $Q(s,a)$ for 18 joystick/button positions

Least Squares Control

Least Squares Policy Iteration
Image from: here
- Policy Evaluation: Least squares Q-learning
  Policy Improvement: Greedy policy improvement

Least Squares Action-Value Function Approximation
- Approximate action-value function $q_\pi(s,a)$
  using linear combination of features $\bold{x}(s,a)$ ,
  $\hat{q}(s,a,\bold{w})=\bold{x}(s,a)^T\bold{w} \approx q_\pi(s,a)$
- Minimize least squares error between $\hat{q}(s,a,\bold{w})$ and $q_\pi(s,a)$ from experience $\mathcal{D}$ generated using policy $\pi$ ,
  $\mathcal{D}=\{ \lang (s_1,a_1,v_1^\pi), (s_2,a_2,v_2^\pi),..., (s_T,a_T,v_T^\pi) \rang \}$

Least Squares Q-learning
Off-policy 상황(old-policy is the behaviour policy)을 생각하고 Q-learning을 적용한다.
- Use experience generated by old policy
  $S_t, A_t, R_{t+1}, S_{t+1} \sim \pi_{old}$
- Consider alternative successor action $A'=\pi_{new}(S_{t+1})$
- Update $\hat{q}(S_t,A_t,\bold{w})$ towards value of alternative action,
  $R_{t+1}+\gamma\hat{q}(S_{t+1},A',\bold{w})$
- Q-learning Update:
  $\delta=R_{t+1}+\gamma\hat{q}(S_{t+1},A',\bold{w})-\hat{q}(S_t,A_t,\bold{w})$
  $\Delta\bold{w}=\alpha\delta\bold{x}(S_t,A_t)$
- LSTDQ algorithm: Total update = 0
  $\mathbb{E}_\mathcal{D}[\Delta\bold{w}]=0$
  $\iff \sum\limits_{t=1}^T\alpha\delta\bold{x}(S_t,A_t)=0$
  $\iff \bold{w}=\left( \sum\limits_{t=1}^T\bold{x}(S_t,A_t)(\bold{x}(S_t,A_t)-\gamma\bold{x}(S_{t+1},\pi(S_{t+1})))^T \right)^{-1}\cdot\sum\limits_{t=1}^T\bold{x}(S_{t},A_t)R_{t+1}$

LSPI-TD with LSTDQ Algorithm
Image from: here
- Experience $\mathcal{D}$ 를 반복적으로 evaluate! (매번 update된 policy로 evaluate)
  => 그래서 Off-policy!
- Lec. 5에서의 Off-policy control by Q-learning 과는 batch(=experience) 단위로 update 된다는 점에서 차이가 있다. (더 효율적)