4.4 n-step TD vs n-step Q-learning

Tommy Kim·2023년 9월 18일

Reinforcement Learning - hyukppenheim youtube

목록 보기

12/13

2-step TD

recall equation for $G_t$ (expected reward)

\begin{aligned} G_t &= R_t + \gamma R_{t+1} + \gamma^2 R_{t+2}…\\ &= R_t + \gamma R_{t+1} + \gamma^2 G_{t+2} \end{aligned}

We will use this equation form to make 2-step TD.
Let’s apply Bellman equation to action value function.

\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1},a_{t+1}} (R_t + \gamma Q(s_{t+1}, a_{t+1}))p(a_{t+1}|s_{t+1}) p(s_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ & = \int\limits_{s_{t+1},a_{t+1},s_{t+2}, a_{t+2}} (R_t + \gamma R_{t+1} + \gamma^2 Q(s_{t+2}, a_{t+2})) p(a_{t+2}|s_{t+2})p(s_{t+2}|s_{t+1},a_{t+1})p(a_{t+1}|s_{t+1})p(s_{t+1}|s_t, a_t)d s_{t+1},a_{t+1},s_{t+2}, a_{t+2} \end{aligned}

As we learned in off-policy, we must apply the behavior policy until the state we are interested in appears. In 2-step policy, we are interested in $s_{t+2}$ . So we have to apply $q(a_t|s_t), q(a_{t+1}|a_{t+1})$ for transition pdf(calling next state) : $p(s_{t+2}|s_{t+1},a_{t+1}), p(s_{t+1}|s_t, a_t)$ .

Here is a problem : $p(a_{t+2}|s_{t+2})$ becomes target policy. So it is fine. But how about $p(a_{t+1}|a_{t+1})$ ? Since we have to use behavior policy $q$ , $p(a_{t+1}|a_{t+1})$ becomes the problem. We use importance sampling to solve this problem.

Off-policy with importance sampling

We achieve importance sampling by sampling idea from Monte Carlo.

\begin{aligned} E[x] &= \int\limits_x x p(x)dx \approx \frac {1} {N} \sum\limits_{i=1}^N x_i, x_i \sim p(x_i)\\ & = \int\limits_x f(x) p(x)dx \approx \frac {1} {N} \sum\limits_{i=1}^N f(x_i), x_i \sim p(x_i)\\ & => \int\limits_x x \frac {p(x)} {q(x)} q(x)dx \approx \frac {1} {N} \sum\limits_{i=1}^N x \frac {p(x)}{q(x)}, x_i \sim q(x_i) \end{aligned}

If N gets great enough, we can think that $\frac {1} {N} \sum\limits_{i=1}^N x_i, x_i \sim p(x_i)$ and $\frac {1} {N} \sum\limits_{i=1}^N x \frac {p(x)}{q(x)}, x_i \sim q(x_i)$ is almost same. By this trick, we can make similar result sampled from different probability distribution function!

Let’s apply this trick to action value function:

\begin{aligned} Q(s_t,a_t) & = \int\limits_{s_{t+1},a_{t+1},s_{t+2}, a_{t+2}} (R_t + \gamma R_{t+1} + \gamma^2 Q(s_{t+2}, a_{t+2})) p(a_{t+2}|s_{t+2})p(s_{t+2}|s_{t+1},a_{t+1}) \frac {p(a_{t+1}|s_{t+1})}{q(a_{t+1}|s_{t+1})}q(a_{t+1}|s_{t+1})p(s_{t+1}|s_t, a_t)d s_{t+1},a_{t+1},s_{t+2}, a_{t+2} \end{aligned}

Now, it is OK to sample from behavior policy( $q({a_{t+1}|s_{t+1}})$ ).
And, as we know that $p(a_{t+2}|s_{t+2})$ is the target policy, we define the function as delta function(that maximizes $Q_{t+2}$ ) : $\delta (a_{t+2}-a_{t+2}^*)$
Then, the TD-target becomes:
$\frac {p(a_{t+1}^{(N)}|s_{t+1})}{q(a_{t+1}^{(N)}|s_{t+1})} (R_t + \gamma R_{t+1} + \gamma^2 \max\limits_{a_{t+2}} Q(s_{t+2}^{(N)},a_{t+2}) )$

$R_t$ is decided by $a_t$ sample, and $\gamma R_{t+1}$ is decided by $a_{t+1}$ sample. The important thing is that both sample is from behavior policy!
$Q(s_{t+2}^{(N)},a_{t+2})$ is decided by $a_{t+2} \sim p$ .

One important point to consider in the equation $\frac {p(a_{t+1}^{(N)}|s_{t+1})}{q(a_{t+1}^{(N)}|s_{t+1})}$ is that $q$ is a probability density function (pdf). This means that the probability value at a specific point $a_{t+1}^{(N)}$ does not exist because it represents a probability over an interval, not at a specific point. On the other hand, $p$ , being the target policy, could take on the form of a delta function. If it does, it will have an infinite value at a specific point, making it challenging to represent a precise value. Consequently, this term is often omitted in the TD-target in many cases.

Tommy Kim

I’m interested in artificial intelligence

이전 포스트

4.3 SARSA vs Q-learning

다음 포스트

4.4 n-step TD vs n-step Q-learning

Reinforcement Learning - hyukppenheim youtube

2-step TD

Off-policy with importance sampling

4.3 SARSA vs Q-learning

5.1 2013 DQN paper review

0개의 댓글

관련 채용 정보