Lecture 17. MDPs & Value/Policy Iteration

cryptnomy·2022년 11월 27일
0

CS229: Machine Learning

목록 보기
17/18
post-thumbnail

Lecture video link: https://youtu.be/d5gaWTo6kDM

Small talk

  • NASA’s InSight Mars Lander to arrive at Mars
  • About 20 light minutes away from the Earth → You actually can’t control it in real-time.

Outline

  • MDPs (Markov Decision Processes) (recap)
  • Value function
  • Value iteration / Policy iteration
  • Learning PsaP_\text{sa} / putting it together

MDP: (S,A,{Psa},γ,R)(S,A,\{P_{sa}\},\gamma,R)

SS: set of states

AA: set of actions, e.g., East, West, South, and North

{Psa}\{P_{sa}\}: state transition probabilities

γ\gamma: discount factor, usually slight less than 1.

RR: reward function that helps us specify where we want the robot to end up.

E.g.,

s0s_0

a0=π(s0)a_0=\pi(s_0)

s1Ps0a0s_1\sim P_{s_0a_0}

a1a_1

s2Ps1a1s_2\sim P_{s_1a_1}

R(s0)+γR(s1)+γ2R(s2)+R(s_0)+\gamma R(s_1)+\gamma^2R(s_2)+\cdots

Policy π:SA\pi: S\longmapsto A

.. The policy maximizes the expected value of the total payoff.

E.g.,

(Source: https://youtu.be/d5gaWTo6kDM?t=295)

π((3,1))=  ()(west)\pi((3,1))=\;(\leftarrow)(\text{west})

Q. How do you compute the optimal policy?

Define: Vπ,V,πV^\pi, V^*, \pi^*.

For a policy π\pi, Vπ:SRV^\pi:S\longmapsto\mathbb R is s.t. Vπ(s)V^\pi(s) is the expected total payoff for storing in state ss and executing π\pi.

Vπ(s)=E[R(s0)+γR(s1)+γ2R(s2)+π,s0=s].V^\pi(s)=\mathbb E[R(s_0)+\gamma R(s_1)+\gamma^2R(s_2)+\cdots|\pi,s_0=s].

VπV^\pi .. “Value function for policy π\pi

Bellman’s equation

Vπ(s)=R(s)+γsPsπ(s)(s)Vπ(s).V^\pi(s)=R(s)+\gamma\sum_{s'}P_{s\pi(s)}(s')V^\pi(s').

Meaning: Your expected reward at a given state = the reward that you receive + the discount factor multiplied by the future reward.

Vπ(s)=E[R(s0)+γ(R(s1)+γR(s2)+)Vπ(s1)π,s0=s].V^\pi(s)=\mathbb E[R(s_0)+\gamma\underbrace{\left(R(s_1)+\gamma R(s_2)+\cdots\right)}_{\sim V^\pi(s_1)}|\pi,s_0=s].

R(s0)R(s_0): immediate reward

γR(s1)+γ2R(s2)+\gamma R(s_1)+\gamma^2 R(s_2)+\cdots: expected future reward

.. ss maps to s0s_0 and ss' maps to s1s_1.

Vπ(s)=E[R(s)+γVπ(s)].V^\pi(s)=\mathbb E[R(s)+\gamma V^\pi(s')].

In state ss, you will take action a=π(s)a=\pi(s):

sPsπ(s).s'\sim P_{s\pi(s)}.
Vπ(s)=R(s)+γsPsπ(s)Vπ(s).V^\pi(s)=R(s)+\gamma\sum_{s'}P_{s\pi(s)}V^\pi(s').

Given π\pi, get a linear system of equations in terms of Vπ(s)V^\pi(s).

E.g.,

Vπ((3,1))=R((3,1))+γ[0.8Vπ((3,2))+0.1Vπ((2,1))+0.1Vπ((4,1))].V^\pi((3,1))=R((3,1))+\gamma\left[0.8V^\pi((3,2))+0.1V^\pi((2,1))+0.1V^\pi((4,1))\right].

11 states are given and it is enough that you solve a system of 11 linear equations with 11 unknowns.

VV^* .. “the optimal value function”

V(s)=maxπVπ(s).V^*(s)=\max_\pi V^\pi(s).

π\pi^* .. “optimal policy”

π(s)=arg maxasPsa(s)V(s).\pi^*(s)=\argmax_a\sum_{s'}P_{sa}(s')V^*(s').

Practice with confusing notation.

V(s)=Vπ(s)Vπ(s).V^*(s)=V^{\pi^*}(s)\geq V^\pi(s).

Strategy:

1) Find VV^*.

2) Use the argmax equation to find π\pi^*.

Value iteration:

Initialize V(s):=0V(s):=0 for every ss.

For every ss, update:

V(s):=R(s)+maxaγsPsa(s)V(s).V(s):=R(s)+\max\limits_a \gamma\sum\limits_{s'}P_{sa}(s')V(s').

V(s)V(s): new estimate

V(s)V(s'): old estimate

E.g.,

[V((1,1))V((1,2))V((4,3))]R11.\begin{bmatrix}V((1,1))\\V((1,2))\\\vdots\\V((4,3))\end{bmatrix}\in\mathbb R^{11}.

Andrew Ng comment: Value iteration works fine with either synchronous update or asynchronous update, but most people use “synchronous update” because it vectorizes better and you can use more efficient matrix opeations. (The algorithm will work with either case.)

Q. How do you represent the absorbing state, or the sink state?

A. In this framework, one way to code that up would be to say that the state transition parameters from that to any other state is 0. Another way: 11-state MDP, and then create 12 states and 12 states always go back to itself with no further rewards.

MyQ. State where no moving actions are detected vs. state where the actor moves somewhere and then comes back .. how can they be the same mathematically?

  • Bellman backup operator
V:=B(V)V:=B(V)

Exercise: Show that value iteration causes VV to converge VV^*.

(Source: https://youtu.be/d5gaWTo6kDM?t=2803)

Action W()W(\leftarrow)

sPsa(s)V(s)=0.8×0.75+0.1×0.69+0.1×0.71=0.740.\sum_{s'}P_{sa}(s')V^*(s')=0.8\times0.75+0.1\times0.69+0.1\times0.71=0.740.

Action N()N(\uparrow)

sPsa(s)V(s)=0.8×0.69+0.1×0.75+0.1×0.49=0.676.\sum_{s'}P_{sa}(s')V^*(s')=0.8\times0.69+0.1\times0.75+0.1\times0.49=0.676.

Q. Is the number of states alway finite?

A. Sure, yep. So in what we’ve discussed so far, yes. But what we’ll see on Wednesday is how to generalize this framework.

Policy iteraion:

Initialize π\pi randomly.

Repeat until convergence:

Set V:=VπV:=V^\pi (i.e., solve Bellman’s equations to get VπV^\pi).

Set π(s):=arg maxasPsa(s)V(s)\pi(s):=\argmax\limits_a\sum\limits_{s'}P_{sa}(s')V(s').

Q. What if we don’t know PsaP_{sa}?

Psa(s)=number of times took action a in state s and got to snumber of times took action a in state s(or 1S if the above is 00.)\begin{aligned}P_{sa}(s')&=\frac{\text{number of times took action } a \text{ in state }s\text{ and got to }s'}{\text{number of times took action }a\text{ in state }s}\\&\left(\text{or }\frac{1}{|S|}\text{ if the above is }\frac{0}{0}.\right)\end{aligned}

Putting it together:

Repeat: {

Take actions w.r.t. π\pi to get experience in MDP.

Update estimates of PsaP_{sa}(, and possibly RR.)

Solve Bellman’s eqn. using value iteration to get VV.

Update π(s):=arg maxasPsa(s)V(s)\pi(s):=\argmax\limits_a\sum\limits_{s'}P_{sa}(s')V(s').

}

Andrew Ng comment: Usually the reward function are given, but you sometimes see a unknown reward function.

E.g., if you’re building a stock trading application and the reward is the return on a certain day. It may not be a function of a state and it may be a little bit random.

Andrew Ng comment: This algorithm will work okay for some problems but there’s one other issue that this cannot solve, which is the exploration problem.

Exploration vs. Exploitation trade-off

Q. How aggressively or how greedy should you be at just taking actions to maximize your rewards?

The algorithm we describe is relatively “greedy,” meaning that it is taking your best estimate of the state transition probabilities and rewards.

  • ϵ\epsilon-greedy → e.g., 0.9 chance w.r.t. π\pi and 0.1 chance randomly.

Q. Shoud you always keep ϵ\epsilon constant or should you use a dynamic ϵ\epsilon?

A. Yes. There are many heuristics for how to explore. One reasonable way: We start by a large value of ϵ\epsilon and slowly shrink it. Or we can try Boltzmann exploartion.

Q. Can you get a reward for reaching states you’ve never seen before?

A. Yes, there is a fascinating line of research called “intrinsic reinforcement learning.” It really started by search indexing. You can google “intrinsic motivation.”

Q. How many actions should you take with respect to π\pi before updating π\pi?

A. There’s no harm to do it as frequently as possible.

0개의 댓글