4.4 n-step TD vs n-step Q-learning

Tommy Kim·2023년 9월 18일
0

2-step TD

recall equation for GtG_t(expected reward)

Gt=Rt+γRt+1+γ2Rt+2=Rt+γRt+1+γ2Gt+2\begin{aligned} G_t &= R_t + \gamma R_{t+1} + \gamma^2 R_{t+2}…\\ &= R_t + \gamma R_{t+1} + \gamma^2 G_{t+2} \end{aligned}

We will use this equation form to make 2-step TD.
Let’s apply Bellman equation to action value function.

Q(st,at)=st+1,at+1(Rt+γQ(st+1,at+1))p(at+1st+1)p(st+1st,at)dst+1,at+1=st+1,at+1,st+2,at+2(Rt+γRt+1+γ2Q(st+2,at+2))p(at+2st+2)p(st+2st+1,at+1)p(at+1st+1)p(st+1st,at)dst+1,at+1,st+2,at+2\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1},a_{t+1}} (R_t + \gamma Q(s_{t+1}, a_{t+1}))p(a_{t+1}|s_{t+1}) p(s_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ & = \int\limits_{s_{t+1},a_{t+1},s_{t+2}, a_{t+2}} (R_t + \gamma R_{t+1} + \gamma^2 Q(s_{t+2}, a_{t+2})) p(a_{t+2}|s_{t+2})p(s_{t+2}|s_{t+1},a_{t+1})p(a_{t+1}|s_{t+1})p(s_{t+1}|s_t, a_t)d s_{t+1},a_{t+1},s_{t+2}, a_{t+2} \end{aligned}

As we learned in off-policy, we must apply the behavior policy until the state we are interested in appears. In 2-step policy, we are interested in st+2s_{t+2}. So we have to apply q(atst),q(at+1at+1)q(a_t|s_t), q(a_{t+1}|a_{t+1}) for transition pdf(calling next state) : p(st+2st+1,at+1),p(st+1st,at)p(s_{t+2}|s_{t+1},a_{t+1}), p(s_{t+1}|s_t, a_t).

Here is a problem : p(at+2st+2)p(a_{t+2}|s_{t+2}) becomes target policy. So it is fine. But how about p(at+1at+1)p(a_{t+1}|a_{t+1})? Since we have to use behavior policy qq, p(at+1at+1)p(a_{t+1}|a_{t+1}) becomes the problem. We use importance sampling to solve this problem.

Off-policy with importance sampling

We achieve importance sampling by sampling idea from Monte Carlo.

E[x]=xxp(x)dx1Ni=1Nxi,xip(xi)=xf(x)p(x)dx1Ni=1Nf(xi),xip(xi)=>xxp(x)q(x)q(x)dx1Ni=1Nxp(x)q(x),xiq(xi)\begin{aligned} E[x] &= \int\limits_x x p(x)dx \approx \frac {1} {N} \sum\limits_{i=1}^N x_i, x_i \sim p(x_i)\\ & = \int\limits_x f(x) p(x)dx \approx \frac {1} {N} \sum\limits_{i=1}^N f(x_i), x_i \sim p(x_i)\\ & => \int\limits_x x \frac {p(x)} {q(x)} q(x)dx \approx \frac {1} {N} \sum\limits_{i=1}^N x \frac {p(x)}{q(x)}, x_i \sim q(x_i) \end{aligned}

If N gets great enough, we can think that 1Ni=1Nxi,xip(xi)\frac {1} {N} \sum\limits_{i=1}^N x_i, x_i \sim p(x_i) and 1Ni=1Nxp(x)q(x),xiq(xi)\frac {1} {N} \sum\limits_{i=1}^N x \frac {p(x)}{q(x)}, x_i \sim q(x_i) is almost same. By this trick, we can make similar result sampled from different probability distribution function!

Let’s apply this trick to action value function:

Q(st,at)=st+1,at+1,st+2,at+2(Rt+γRt+1+γ2Q(st+2,at+2))p(at+2st+2)p(st+2st+1,at+1)p(at+1st+1)q(at+1st+1)q(at+1st+1)p(st+1st,at)dst+1,at+1,st+2,at+2\begin{aligned} Q(s_t,a_t) & = \int\limits_{s_{t+1},a_{t+1},s_{t+2}, a_{t+2}} (R_t + \gamma R_{t+1} + \gamma^2 Q(s_{t+2}, a_{t+2})) p(a_{t+2}|s_{t+2})p(s_{t+2}|s_{t+1},a_{t+1}) \frac {p(a_{t+1}|s_{t+1})}{q(a_{t+1}|s_{t+1})}q(a_{t+1}|s_{t+1})p(s_{t+1}|s_t, a_t)d s_{t+1},a_{t+1},s_{t+2}, a_{t+2} \end{aligned}

Now, it is OK to sample from behavior policy(q(at+1st+1)q({a_{t+1}|s_{t+1}})).
And, as we know that p(at+2st+2)p(a_{t+2}|s_{t+2}) is the target policy, we define the function as delta function(that maximizes Qt+2Q_{t+2} ) : δ(at+2at+2)\delta (a_{t+2}-a_{t+2}^*)
Then, the TD-target becomes:
p(at+1(N)st+1)q(at+1(N)st+1)(Rt+γRt+1+γ2maxat+2Q(st+2(N),at+2))\frac {p(a_{t+1}^{(N)}|s_{t+1})}{q(a_{t+1}^{(N)}|s_{t+1})} (R_t + \gamma R_{t+1} + \gamma^2 \max\limits_{a_{t+2}} Q(s_{t+2}^{(N)},a_{t+2}) )

RtR_t is decided by ata_t sample, and γRt+1\gamma R_{t+1} is decided by at+1a_{t+1} sample. The important thing is that both sample is from behavior policy!
Q(st+2(N),at+2)Q(s_{t+2}^{(N)},a_{t+2}) is decided by at+2pa_{t+2} \sim p.

One important point to consider in the equation p(at+1(N)st+1)q(at+1(N)st+1)\frac {p(a_{t+1}^{(N)}|s_{t+1})}{q(a_{t+1}^{(N)}|s_{t+1})} is that qq is a probability density function (pdf). This means that the probability value at a specific point at+1(N)a_{t+1}^{(N)}does not exist because it represents a probability over an interval, not at a specific point. On the other hand, pp, being the target policy, could take on the form of a delta function. If it does, it will have an infinite value at a specific point, making it challenging to represent a precise value. Consequently, this term is often omitted in the TD-target in many cases.

profile
I’m interested in artificial intelligence

0개의 댓글