how can we guarantee that we find the best possible policy
we have to balance two different factors
how can we actually gain efficiency by boorstrapping
"내일 날씨가 흐릴것 같으니, 모레는 아마 비가오겠네"
always using the most recent value function to pick your action -> increase the frequancy of our policy improvement to be evey single time step we're going to improve our policy
all the way until the end of the episode never bootstrap from value function
target : n step reward
Q is expected total reward you get - start in state and take action and then follow policy for subsequent actions
가중평균
build a spectrum between MC and TD so we've got variant of sarsa that can look all the way out to the future
behavior policy
Big issue in reinforcement learning
내가 원하는 만큼 돌아다니는 explore policy를 가지면서 동시에 optimal policy도 learn하고싶어!
Q Learning
specific to sarsa0
we're going to make use of Q values, action values to help us do off policy learning in efficient way that doesn't require important sampling
sampling from behavior policy
sampling from tartget policy
consider the real behavior you actually took