Temporal difference has a problem.
Let’s review SARSA(from Bellman equation) and MC.
SARSA:
Q(st,at)=st+1:a∞∫Gtp(st+1,at+1,…∣st,at)dst+1:a∞=st+1,at+1∫(Rt+γQ(st+1,at+1))p(st+1,at+1∣st,at)dst+1,at+1≈N1i=1∑N(Rt(i)+γQ(st+1(i),at+1(i)))
Monte Carlo:
Q(st,at)≈N1i=1∑NGt(i)
The problem is, the sample in SARSA(Q(st,at),Q(st+1,at+1,..) is not perfect. This is because two method have different sampling method. The Monte Carlo method samples by taking all possible paths as N increases, whereas SARSA samples the Q-value of the next step. Since even the Q-value of the subsequent step is merely an average (or expectation) of values up to a∞ steps, it's not perfect. This imperfection results from the method continuously updating using values from the subsequent steps that are also not perfect samples. Such issues introduce bias
MC vs TD
MC
- unbiased
- high variance (∵ it takes so many possible ways)
TD
- biased
- low variance(because$ it cares only next step)