David silver youtube
-Markov Reward Process(MRP)
- Markov Decision Process(MDP)
- state s fully characterizes your future rewards
- so we don't care about reward past because it's already consumed
- What we want is maximize the reward from now on
- how good is it to be in state s if I following policy π
- white to black possibility is defined by policy
qπ is q value(action value)
- vπ(s′) state value function
qπ action value function
- put two together
- s is relative to s′
- reculsive relationship
- you can do it to action values
- as well a is relative to a′
- beneth black dot is to show the process
-
we get q we can calculate MDP but how to arrive at q, figuring out q*
-
Bellman optimality equation
-
look at the value of each action you can take and pick the max of them.
-
v∗(s)=amaxq∗(s,a)
- look optimal value we end up, back these things all the up to v∗(s)
- dinamic programming method will solving these resculsive equation