RL Course by David Silver - Lecture 4: Model-Free Prediction

HO SEUNG YOON·2024년 4월 29일

David Silver reinforcement learning

0

Reinforcement Learning

목록 보기

6/9

Monte-Carlo Learning

Caveat(warning) : this only works with episodic MDP

Temporal-Difference Learning

TD target : $R_{t+1}+\gamma V(S_{t+1})$
When driving you almost crashed toward other card but didn't. Monte carlo doesn't update because didn't crashed.

with Monte carlo learning you update toward actual outcome, wait until you finally get there.
with TD learning immediately update.
there is no actions here we just trying to do policy estimation.

during the TD learning all of your guesses are progressively becoming better and that information backs up such that you get the correct value function.

TD target much lower variance because we're only looking at the noise over the first step
the state that you see is noisy but the value function that

we are not bootstrapping from initial value
what is bootstrapping?

when appropiately chosen step sizes, TD continues to do better than monte carlo.
effect of bootstrapping, as above linked blog if you chose wrong step size, for example large step size you will oscillate(vibrate?) around the True Value function. you dont guaranteed to converge all the way to the optimal solution.

TD( $\lambda$ )

윤냠

이전 포스트

RL Course by David Silver - Lecture 3: Planning by Dynamic Programming

다음 포스트

RL Course by David Silver - Lecture 5: Model Free Control

0개의 댓글

관련 채용 정보