3.3 Temporal difference(TD) & SARSA

Tommy Kim·2023년 9월 12일
0

Temporal difference

Recall the Bellman equation :

Q(st,at)=st+1:aGtp(st+1,at+1,st,at)dst+1:a=st+1,at+1(Rt+γQ(st+1,at+1))p(st+1,at+1st,at)dst+1,at+11Ni=1N(Rt(i)+γQ(st+1(i),at+1(i)))lets say this equation QN\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1}:a_\infin} G_t p( s_{t+1}, a_{t+1},… | s_t,a_t) ds_{t+1}:a_\infin \\ &=\int\limits_{s_{t+1},a_{t+1}}(R_t + \gamma Q_(s_{t+1}, a_{t+1}))p(s_{t+1},a_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &\approx \frac {1}{N} \sum\limits_{i=1}^N (R_t^{(i)} + \gamma Q(s_{t+1}^{(i)}, a_{t+1}^{(i)}))\\ & \mathrm{let’s\ say\ this\ equation\ } \overline{Q}_N \end{aligned}

The differnce between temporal difference and Monte Carlo is that temporal difference method includes next action value function. And, as observed from the equation, Rt(i)R_t^{(i)}is also a sampled value. In deterministic search problems, there are no variations. However, in interactive scenarios like video games or self-driving vehicles, the agent can encounter entirely different situations independent of its actions. As a result, the reward becomes a random value.

Since temporal difference method includes next action value function, we need to focus on just one step. This is called 1-step temporal difference.

Temporal difference is flexible. It has lot of variants(2-step, 3-step, …)
If we use 2-step TP, the equation is:
Q(st,at)=Rt+γRt+1+γ2Q(st+2,at+2)Q(s_t,a_t) = R_t + \gamma R_{t+1} + \gamma^2 Q(s_{t+2}, a_{t+2})

Now, let’s expand QN\overline{Q}_N:
QN=1N(QN1(N1)+Rt(N)+γQ(st+1(N),at+1(N)))\overline{Q}_N = \frac {1}{N} (\overline{Q}_{N-1}(N-1) + R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)}))
This expansion means that we have already QN1(N1)\overline{Q}_{N-1}(N-1), and add N-th term.
Let’s continue expansion.
QN=QN1+1N(Rt(N)+γQ(st+1(N),at+1(N))QN1)\overline{Q}_N = \overline{Q}_{N-1} + \frac {1}{N}(R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)}) - \overline{Q}_{N-1})
The upper expansion is also called Incremental Monte Carlo updates.

If we take 1N\frac {1}{N} as α\alpha, amazing thing happens.

QN=QN1+1N(Rt(N)+γQ(st+1(N),at+1(N))QN1)=(1α)QN1+α(Rt(N)+γQ(st+1(N),at+1(N))\begin{aligned} \overline{Q}_N &= \overline{Q}_{N-1} + \frac {1}{N}(R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)}) - \overline{Q}_{N-1})\\ &= (1-\alpha) \overline{Q}_{N-1} + \alpha (R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)}) \end{aligned}

We have seen this equation at Q-learning. But, it is different slightly because Q-learning brings current Q term as maximum value.

Temporal difference terminology

TD-error : difference between current value and 1-step before.(It can be 2-step before or 3-step,…)
Rt(N)+γQ(st+1(N),at+1(N))QN1R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)}) - \overline{Q}_{N-1}

TD-target : The sample data used for TD-error
Rt(N)+γQ(st+1(N),at+1(N))R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)})

SARSA

SARSA stands for:
S -> current state
A -> current action
R -> reward corresponding to current state & action
S -> next state
A -> next action
(1α)QN1+α(Rt(N)+γQ(st+1(N),at+1(N))(1-\alpha) \overline{Q}_{N-1} + \alpha (R_t^{(N)} + \gamma Q(s_{t+1}^{(N)}, a_{t+1}^{(N)})
This equation shows SARSA(QN1\overline{Q}_{N-1} stands for current state & action, Rt(N)R_t^{(N)} stands for rewards, Q(st+1(N),at+1(N))Q(s_{t+1}^{(N)}, a_{t+1}^{(N)}) stands for next state & action)

An important point to note is that both methods, Monte Carlo and Temporal Difference, rely on the decaying ϵ\epsilon- greedy strategy. This ensures the convergence of the action value function and its corresponding policy to the optimal!

profile
I’m interested in artificial intelligence

0개의 댓글