3.4 MC vs TD

Tommy Kim·2023년 9월 14일

Reinforcement Learning - hyukppenheim youtube

목록 보기

8/13

Temporal difference has a problem.

Let’s review SARSA(from Bellman equation) and MC.
SARSA:

\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1}:a_\infin} G_t p( s_{t+1}, a_{t+1},… | s_t,a_t) ds_{t+1}:a_\infin \\ &=\int\limits_{s_{t+1},a_{t+1}}(R_t + \gamma Q_(s_{t+1}, a_{t+1}))p(s_{t+1},a_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &\approx \frac {1}{N} \sum\limits_{i=1}^N (R_t^{(i)} + \gamma Q(s_{t+1}^{(i)}, a_{t+1}^{(i)}))\\ \end{aligned}

Monte Carlo:
$Q(s_t, a_t) \approx \frac {1}{N} \sum\limits_{i=1}^N G_t^{(i)}$

The problem is, the sample in SARSA( $Q(s_t, a_t) , Q(s_{t+1},a_{t+1}, ..$ ) is not perfect. This is because two method have different sampling method. The Monte Carlo method samples by taking all possible paths as N increases, whereas SARSA samples the Q-value of the next step. Since even the Q-value of the subsequent step is merely an average (or expectation) of values up to $a_\infin$ steps, it's not perfect. This imperfection results from the method continuously updating using values from the subsequent steps that are also not perfect samples. Such issues introduce bias