11 states are given and it is enough that you solve a system of 11 linear equations with 11 unknowns.
V∗ .. “the optimal value function”
V∗(s)=πmaxVπ(s).
π∗ .. “optimal policy”
π∗(s)=aargmaxs′∑Psa(s′)V∗(s′).
Practice with confusing notation.
V∗(s)=Vπ∗(s)≥Vπ(s).
Strategy:
1) Find V∗.
2) Use the argmax equation to find π∗.
Value iteration:
Initialize V(s):=0 for every s.
For every s, update:
V(s):=R(s)+amaxγs′∑Psa(s′)V(s′).
V(s): new estimate
V(s′): old estimate
E.g.,
⎣⎢⎢⎢⎢⎡V((1,1))V((1,2))⋮V((4,3))⎦⎥⎥⎥⎥⎤∈R11.
Andrew Ng comment: Value iteration works fine with either synchronous update or asynchronous update, but most people use “synchronous update” because it vectorizes better and you can use more efficient matrix opeations. (The algorithm will work with either case.)
Q. How do you represent the absorbing state, or the sink state?
A. In this framework, one way to code that up would be to say that the state transition parameters from that to any other state is 0. Another way: 11-state MDP, and then create 12 states and 12 states always go back to itself with no further rewards.
MyQ. State where no moving actions are detected vs. state where the actor moves somewhere and then comes back .. how can they be the same mathematically?
Bellman backup operator
V:=B(V)
Exercise: Show that value iteration causes V to converge V∗.
A. Sure, yep. So in what we’ve discussed so far, yes. But what we’ll see on Wednesday is how to generalize this framework.
Policy iteraion:
Initialize π randomly.
Repeat until convergence:
Set V:=Vπ (i.e., solve Bellman’s equations to get Vπ).
Set π(s):=aargmaxs′∑Psa(s′)V(s′).
Q. What if we don’t know Psa?
Psa(s′)=number of times took action a in state snumber of times took action a in state s and got to s′(or ∣S∣1 if the above is 00.)
Putting it together:
Repeat: {
Take actions w.r.t. π to get experience in MDP.
Update estimates of Psa(, and possibly R.)
Solve Bellman’s eqn. using value iteration to get V.
Update π(s):=aargmaxs′∑Psa(s′)V(s′).
}
Andrew Ng comment: Usually the reward function are given, but you sometimes see a unknown reward function.
E.g., if you’re building a stock trading application and the reward is the return on a certain day. It may not be a function of a state and it may be a little bit random.
Andrew Ng comment: This algorithm will work okay for some problems but there’s one other issue that this cannot solve, which is the exploration problem.
Exploration vs. Exploitation trade-off
Q. How aggressively or how greedy should you be at just taking actions to maximize your rewards?
The algorithm we describe is relatively “greedy,” meaning that it is taking your best estimate of the state transition probabilities and rewards.
Q. Shoud you always keep ϵ constant or should you use a dynamic ϵ?
A. Yes. There are many heuristics for how to explore. One reasonable way: We start by a large value of ϵ and slowly shrink it. Or we can try Boltzmann exploartion.
Q. Can you get a reward for reaching states you’ve never seen before?
A. Yes, there is a fascinating line of research called “intrinsic reinforcement learning.” It really started by search indexing. You can google “intrinsic motivation.”
Q. How many actions should you take with respect to π before updating π?
A. There’s no harm to do it as frequently as possible.