David silver youtube
-Markov Reward Process(MRP)




- Markov Decision Process(MDP)



- state s fully characterizes your future rewards
- so we don't care about reward past because it's already consumed
- What we want is maximize the reward from now on


- how good is it to be in state s if I following policy π


- white to black possibility is defined by policy
qπ is q value(action value)

- vπ(s′) state value function
qπ action value function

- put two together
- s is relative to s′
- reculsive relationship

- you can do it to action values
- as well a is relative to a′

- beneth black dot is to show the process






-
we get q we can calculate MDP but how to arrive at q, figuring out q*
-
Bellman optimality equation
-
look at the value of each action you can take and pick the max of them.
-
v∗(s)=amaxq∗(s,a)


- look optimal value we end up, back these things all the up to v∗(s)



- dinamic programming method will solving these resculsive equation