3.4 MC vs TD

Tommy Kim·2023년 9월 14일
0

Temporal difference has a problem.

Let’s review SARSA(from Bellman equation) and MC.
SARSA:

Q(st,at)=st+1:aGtp(st+1,at+1,st,at)dst+1:a=st+1,at+1(Rt+γQ(st+1,at+1))p(st+1,at+1st,at)dst+1,at+11Ni=1N(Rt(i)+γQ(st+1(i),at+1(i)))\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1}:a_\infin} G_t p( s_{t+1}, a_{t+1},… | s_t,a_t) ds_{t+1}:a_\infin \\ &=\int\limits_{s_{t+1},a_{t+1}}(R_t + \gamma Q_(s_{t+1}, a_{t+1}))p(s_{t+1},a_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &\approx \frac {1}{N} \sum\limits_{i=1}^N (R_t^{(i)} + \gamma Q(s_{t+1}^{(i)}, a_{t+1}^{(i)}))\\ \end{aligned}

Monte Carlo:
Q(st,at)1Ni=1NGt(i)Q(s_t, a_t) \approx \frac {1}{N} \sum\limits_{i=1}^N G_t^{(i)}

The problem is, the sample in SARSA(Q(st,at),Q(st+1,at+1,..Q(s_t, a_t) , Q(s_{t+1},a_{t+1}, ..) is not perfect. This is because two method have different sampling method. The Monte Carlo method samples by taking all possible paths as N increases, whereas SARSA samples the Q-value of the next step. Since even the Q-value of the subsequent step is merely an average (or expectation) of values up to aa_\infin steps, it's not perfect. This imperfection results from the method continuously updating using values from the subsequent steps that are also not perfect samples. Such issues introduce bias

MC vs TD

MC

  1. unbiased
  2. high variance (\because it takes so many possible ways)

TD

  1. biased
  2. low variance(becausebecause$ it cares only next step)
profile
I’m interested in artificial intelligence

0개의 댓글