Temporal difference
Recall the Bellman equation :
Q(st,at)=st+1:a∞∫Gtp(st+1,at+1,…∣st,at)dst+1:a∞=st+1,at+1∫(Rt+γQ(st+1,at+1))p(st+1,at+1∣st,at)dst+1,at+1≈N1i=1∑N(Rt(i)+γQ(st+1(i),at+1(i)))let’s say this equation QN
The differnce between temporal difference and Monte Carlo is that temporal difference method includes next action value function. And, as observed from the equation, Rt(i)is also a sampled value. In deterministic search problems, there are no variations. However, in interactive scenarios like video games or self-driving vehicles, the agent can encounter entirely different situations independent of its actions. As a result, the reward becomes a random value.
Since temporal difference method includes next action value function, we need to focus on just one step. This is called 1-step temporal difference.
Temporal difference is flexible. It has lot of variants(2-step, 3-step, …)
If we use 2-step TP, the equation is:
Q(st,at)=Rt+γRt+1+γ2Q(st+2,at+2)
Now, let’s expand QN:
QN=N1(QN−1(N−1)+Rt(N)+γQ(st+1(N),at+1(N)))
This expansion means that we have already QN−1(N−1), and add N-th term.
Let’s continue expansion.
QN=QN−1+N1(Rt(N)+γQ(st+1(N),at+1(N))−QN−1)
The upper expansion is also called Incremental Monte Carlo updates.
If we take N1 as α, amazing thing happens.
QN=QN−1+N1(Rt(N)+γQ(st+1(N),at+1(N))−QN−1)=(1−α)QN−1+α(Rt(N)+γQ(st+1(N),at+1(N))
We have seen this equation at Q-learning. But, it is different slightly because Q-learning brings current Q term as maximum value.
Temporal difference terminology
TD-error : difference between current value and 1-step before.(It can be 2-step before or 3-step,…)
Rt(N)+γQ(st+1(N),at+1(N))−QN−1
TD-target : The sample data used for TD-error
Rt(N)+γQ(st+1(N),at+1(N))
SARSA
SARSA stands for:
S -> current state
A -> current action
R -> reward corresponding to current state & action
S -> next state
A -> next action
(1−α)QN−1+α(Rt(N)+γQ(st+1(N),at+1(N))
This equation shows SARSA(QN−1 stands for current state & action, Rt(N) stands for rewards, Q(st+1(N),at+1(N)) stands for next state & action)
An important point to note is that both methods, Monte Carlo and Temporal Difference, rely on the decaying ϵ- greedy strategy. This ensures the convergence of the action value function and its corresponding policy to the optimal!