2-step TD
recall equation for Gt(expected reward)
Gt=Rt+γRt+1+γ2Rt+2…=Rt+γRt+1+γ2Gt+2
We will use this equation form to make 2-step TD.
Let’s apply Bellman equation to action value function.
Q(st,at)=st+1,at+1∫(Rt+γQ(st+1,at+1))p(at+1∣st+1)p(st+1∣st,at)dst+1,at+1=st+1,at+1,st+2,at+2∫(Rt+γRt+1+γ2Q(st+2,at+2))p(at+2∣st+2)p(st+2∣st+1,at+1)p(at+1∣st+1)p(st+1∣st,at)dst+1,at+1,st+2,at+2
As we learned in off-policy, we must apply the behavior policy until the state we are interested in appears. In 2-step policy, we are interested in st+2. So we have to apply q(at∣st),q(at+1∣at+1) for transition pdf(calling next state) : p(st+2∣st+1,at+1),p(st+1∣st,at).
Here is a problem : p(at+2∣st+2) becomes target policy. So it is fine. But how about p(at+1∣at+1)? Since we have to use behavior policy q, p(at+1∣at+1) becomes the problem. We use importance sampling to solve this problem.
Off-policy with importance sampling
We achieve importance sampling by sampling idea from Monte Carlo.
E[x]=x∫xp(x)dx≈N1i=1∑Nxi,xi∼p(xi)=x∫f(x)p(x)dx≈N1i=1∑Nf(xi),xi∼p(xi)=>x∫xq(x)p(x)q(x)dx≈N1i=1∑Nxq(x)p(x),xi∼q(xi)
If N gets great enough, we can think that N1i=1∑Nxi,xi∼p(xi) and N1i=1∑Nxq(x)p(x),xi∼q(xi) is almost same. By this trick, we can make similar result sampled from different probability distribution function!
Let’s apply this trick to action value function:
Q(st,at)=st+1,at+1,st+2,at+2∫(Rt+γRt+1+γ2Q(st+2,at+2))p(at+2∣st+2)p(st+2∣st+1,at+1)q(at+1∣st+1)p(at+1∣st+1)q(at+1∣st+1)p(st+1∣st,at)dst+1,at+1,st+2,at+2
Now, it is OK to sample from behavior policy(q(at+1∣st+1)).
And, as we know that p(at+2∣st+2) is the target policy, we define the function as delta function(that maximizes Qt+2 ) : δ(at+2−at+2∗)
Then, the TD-target becomes:
q(at+1(N)∣st+1)p(at+1(N)∣st+1)(Rt+γRt+1+γ2at+2maxQ(st+2(N),at+2))
Rt is decided by at sample, and γRt+1 is decided by at+1 sample. The important thing is that both sample is from behavior policy!
Q(st+2(N),at+2) is decided by at+2∼p.
One important point to consider in the equation q(at+1(N)∣st+1)p(at+1(N)∣st+1) is that q is a probability density function (pdf). This means that the probability value at a specific point at+1(N)does not exist because it represents a probability over an interval, not at a specific point. On the other hand, p, being the target policy, could take on the form of a delta function. If it does, it will have an infinite value at a specific point, making it challenging to represent a precise value. Consequently, this term is often omitted in the TD-target in many cases.