Temporal difference has a problem.
Let’s review SARSA(from Bellman equation) and MC.
Monte Carlo:
The problem is, the sample in SARSA(Q(st,at),Q(st+1,at+1,..) is not perfect. This is because two method have different sampling method. The Monte Carlo method samples by taking all possible paths as N increases, whereas SARSA samples the Q-value of the next step. Since even the Q-value of the subsequent step is merely an average (or expectation) of values up to a∞ steps, it's not perfect. This imperfection results from the method continuously updating using values from the subsequent steps that are also not perfect samples. Such issues introduce bias
MC vs TD
- unbiased
- high variance (∵ it takes so many possible ways)
- biased
- low variance(because$ it cares only next step)