3.3 Temporal difference(TD) & SARSA

Tommy Kim·2023년 9월 12일

Reinforcement Learning - hyukppenheim youtube

목록 보기

7/13

Temporal difference

Recall the Bellman equation :

\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1}:a_\infin} G_t p( s_{t+1}, a_{t+1},… | s_t,a_t) ds_{t+1}:a_\infin \\ &=\int\limits_{s_{t+1},a_{t+1}}(R_t + \gamma Q_(s_{t+1}, a_{t+1}))p(s_{t+1},a_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &\approx \frac {1}{N} \sum\limits_{i=1}^N (R_t^{(i)} + \gamma Q(s_{t+1}^{(i)}, a_{t+1}^{(i)}))\\ & \mathrm{let’s\ say\ this\ equation\ } \overline{Q}_N \end{aligned}

The differnce between temporal difference and Monte Carlo is that temporal difference method includes next action value function. And, as observed from the equation, $R_t^{(i)}$ is also a sampled value. In deterministic search problems, there are no variations. However, in interactive scenarios like video games or self-driving vehicles, the agent can encounter entirely different situations independent of its actions. As a result, the reward becomes a random value.

Since temporal difference method includes next action value function, we need to focus on just one step. This is called 1-step temporal difference.

Temporal difference is flexible. It has lot of variants(2-step, 3-step, …)
If we use 2-step TP, the equation is:
$Q(s_t,a_t) = R_t + \gamma R_{t+1} + \gamma^2 Q(s_{t+2}, a_{t+2})$

Now, let’s expand $\overline{Q}_N$ :
$\overline{Q}_N = \frac {1}{N} (\overline{Q}_{N-1}(N-1) + R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)}))$
This expansion means that we have already $\overline{Q}_{N-1}(N-1)$ , and add N-th term.
Let’s continue expansion.
$\overline{Q}_N = \overline{Q}_{N-1} + \frac {1}{N}(R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)}) - \overline{Q}_{N-1})$
The upper expansion is also called Incremental Monte Carlo updates.

If we take $\frac {1}{N}$ as $\alpha$ , amazing thing happens.

\begin{aligned} \overline{Q}_N &= \overline{Q}_{N-1} + \frac {1}{N}(R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)}) - \overline{Q}_{N-1})\\ &= (1-\alpha) \overline{Q}_{N-1} + \alpha (R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)}) \end{aligned}

We have seen this equation at Q-learning. But, it is different slightly because Q-learning brings current Q term as maximum value.

Temporal difference terminology

TD-error : difference between current value and 1-step before.(It can be 2-step before or 3-step,…)
$R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)}) - \overline{Q}_{N-1}$

TD-target : The sample data used for TD-error
$R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)})$

SARSA

SARSA stands for:
S -> current state
A -> current action
R -> reward corresponding to current state & action
S -> next state
A -> next action
$(1-\alpha) \overline{Q}_{N-1} + \alpha (R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)})$
This equation shows SARSA( $\overline{Q}_{N-1}$ stands for current state & action, $R_t^{(N)}$ stands for rewards, $Q(s_{t+1}^{(N)}, a_{t+1}^{(N)})$ stands for next state & action)

An important point to note is that both methods, Monte Carlo and Temporal Difference, rely on the decaying $\epsilon$ - greedy strategy. This ensures the convergence of the action value function and its corresponding policy to the optimal!

Tommy Kim

I’m interested in artificial intelligence

이전 포스트

3.2 Monte Carlo(MC)

다음 포스트