what if vast action space
disadvantage :
-continuous action space
we take an action from a given state , we compute the gradient of this log policy , we multiply it by the immediate reward that we actually experienced
how about multi step mdp?
replace immediate reward with the value function (the long term reward)
policy objective function이 뭐던간에
만약 Supervised learning이라면 Value function이 없고 adjusting major policy to the direction of what teacher tells you
value based RL usually more jaggy but its more smooth
slow and very high variance
rest of lecture learn what is more efficient way
start off with a arbitrary policy and evaluate that policy using the critic and then instead of greedy policy improvement we're moving a gradient step in direction to get a better policy
value-function - value based
policy - policy based
파라미터를 reward많이 얻는 action을 선택하는 policy가 되도록 업데이트