Lecture 17. MDPs & Value/Policy Iteration

cryptnomy·2022년 11월 27일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

17/18

Lecture video link: https://youtu.be/d5gaWTo6kDM

Small talk

NASA’s InSight Mars Lander to arrive at Mars
About 20 light minutes away from the Earth → You actually can’t control it in real-time.

Outline

MDPs (Markov Decision Processes) (recap)
Value function
Value iteration / Policy iteration
Learning $P_\text{sa}$ / putting it together

MDP: $(S,A,\{P_{sa}\},\gamma,R)$

$S$ : set of states

$A$ : set of actions, e.g., East, West, South, and North

$\{P_{sa}\}$ : state transition probabilities

$\gamma$ : discount factor, usually slight less than 1.

$R$ : reward function that helps us specify where we want the robot to end up.

E.g.,

$s_0$

$a_0=\pi(s_0)$

$s_1\sim P_{s_0a_0}$

$a_1$

$s_2\sim P_{s_1a_1}$

$R(s_0)+\gamma R(s_1)+\gamma^2R(s_2)+\cdots$

Policy $\pi: S\longmapsto A$

.. The policy maximizes the expected value of the total payoff.

E.g.,

(Source: https://youtu.be/d5gaWTo6kDM?t=295)

$\pi((3,1))=\;(\leftarrow)(\text{west})$

Q. How do you compute the optimal policy?

Define: $V^\pi, V^*, \pi^*$ .

For a policy $\pi$ , $V^\pi:S\longmapsto\mathbb R$ is s.t. $V^\pi(s)$ is the expected total payoff for storing in state $s$ and executing $\pi$ .

V^\pi(s)=\mathbb E[R(s_0)+\gamma R(s_1)+\gamma^2R(s_2)+\cdots|\pi,s_0=s].

$V^\pi$ .. “Value function for policy $\pi$ ”

Bellman’s equation

V^\pi(s)=R(s)+\gamma\sum_{s'}P_{s\pi(s)}(s')V^\pi(s').

Meaning: Your expected reward at a given state = the reward that you receive + the discount factor multiplied by the future reward.

V^\pi(s)=\mathbb E[R(s_0)+\gamma\underbrace{\left(R(s_1)+\gamma R(s_2)+\cdots\right)}_{\sim V^\pi(s_1)}|\pi,s_0=s].

$R(s_0)$ : immediate reward

$\gamma R(s_1)+\gamma^2 R(s_2)+\cdots$ : expected future reward

.. $s$ maps to $s_0$ and $s'$ maps to $s_1$ .

V^\pi(s)=\mathbb E[R(s)+\gamma V^\pi(s')].

In state $s$ , you will take action $a=\pi(s)$ :

s'\sim P_{s\pi(s)}.

V^\pi(s)=R(s)+\gamma\sum_{s'}P_{s\pi(s)}V^\pi(s').

Given $\pi$ , get a linear system of equations in terms of $V^\pi(s)$ .

E.g.,

V^\pi((3,1))=R((3,1))+\gamma\left[0.8V^\pi((3,2))+0.1V^\pi((2,1))+0.1V^\pi((4,1))\right].

11 states are given and it is enough that you solve a system of 11 linear equations with 11 unknowns.

$V^*$ .. “the optimal value function”

V^*(s)=\max_\pi V^\pi(s).

$\pi^*$ .. “optimal policy”

\pi^*(s)=\argmax_a\sum_{s'}P_{sa}(s')V^*(s').

Practice with confusing notation.

V^*(s)=V^{\pi^*}(s)\geq V^\pi(s).

Strategy:

1) Find $V^*$ .

2) Use the argmax equation to find $\pi^*$ .

Value iteration:

Initialize $V(s):=0$ for every $s$ .

For every $s$ , update:

$V(s):=R(s)+\max\limits_a \gamma\sum\limits_{s'}P_{sa}(s')V(s').$

$V(s)$ : new estimate

$V(s')$ : old estimate

E.g.,

\begin{bmatrix}V((1,1))\\V((1,2))\\\vdots\\V((4,3))\end{bmatrix}\in\mathbb R^{11}.

Andrew Ng comment: Value iteration works fine with either synchronous update or asynchronous update, but most people use “synchronous update” because it vectorizes better and you can use more efficient matrix opeations. (The algorithm will work with either case.)

Q. How do you represent the absorbing state, or the sink state?

A. In this framework, one way to code that up would be to say that the state transition parameters from that to any other state is 0. Another way: 11-state MDP, and then create 12 states and 12 states always go back to itself with no further rewards.

MyQ. State where no moving actions are detected vs. state where the actor moves somewhere and then comes back .. how can they be the same mathematically?

Bellman backup operator

V:=B(V)

Exercise: Show that value iteration causes $V$ to converge $V^*$ .

(Source: https://youtu.be/d5gaWTo6kDM?t=2803)

Action $W(\leftarrow)$

\sum_{s'}P_{sa}(s')V^*(s')=0.8\times0.75+0.1\times0.69+0.1\times0.71=0.740.

Action $N(\uparrow)$

\sum_{s'}P_{sa}(s')V^*(s')=0.8\times0.69+0.1\times0.75+0.1\times0.49=0.676.

Q. Is the number of states alway finite?

A. Sure, yep. So in what we’ve discussed so far, yes. But what we’ll see on Wednesday is how to generalize this framework.

Policy iteraion:

Initialize $\pi$ randomly.

Repeat until convergence:

Set $V:=V^\pi$ (i.e., solve Bellman’s equations to get $V^\pi$ ).

Set $\pi(s):=\argmax\limits_a\sum\limits_{s'}P_{sa}(s')V(s')$ .

Q. What if we don’t know $P_{sa}$ ?

\begin{aligned}P_{sa}(s')&=\frac{\text{number of times took action } a \text{ in state }s\text{ and got to }s'}{\text{number of times took action }a\text{ in state }s}\\&\left(\text{or }\frac{1}{|S|}\text{ if the above is }\frac{0}{0}.\right)\end{aligned}

Putting it together:

Repeat: {

Take actions w.r.t. $\pi$ to get experience in MDP.

Update estimates of $P_{sa}$ (, and possibly $R$ .)

Solve Bellman’s eqn. using value iteration to get $V$ .

Update $\pi(s):=\argmax\limits_a\sum\limits_{s'}P_{sa}(s')V(s')$ .

}

Andrew Ng comment: Usually the reward function are given, but you sometimes see a unknown reward function.

E.g., if you’re building a stock trading application and the reward is the return on a certain day. It may not be a function of a state and it may be a little bit random.

Andrew Ng comment: This algorithm will work okay for some problems but there’s one other issue that this cannot solve, which is the exploration problem.

Exploration vs. Exploitation trade-off

Q. How aggressively or how greedy should you be at just taking actions to maximize your rewards?

The algorithm we describe is relatively “greedy,” meaning that it is taking your best estimate of the state transition probabilities and rewards.

$\epsilon$ -greedy → e.g., 0.9 chance w.r.t. $\pi$ and 0.1 chance randomly.

Q. Shoud you always keep $\epsilon$ constant or should you use a dynamic $\epsilon$ ?

A. Yes. There are many heuristics for how to explore. One reasonable way: We start by a large value of $\epsilon$ and slowly shrink it. Or we can try Boltzmann exploartion.

Q. Can you get a reward for reaching states you’ve never seen before?

A. Yes, there is a fascinating line of research called “intrinsic reinforcement learning.” It really started by search indexing. You can google “intrinsic motivation.”

Q. How many actions should you take with respect to $\pi$ before updating $\pi$ ?

A. There’s no harm to do it as frequently as possible.

cryptnomy

이전 포스트

Lecture 16. Independent Component Analysis & RL

다음 포스트

Lecture 17. MDPs & Value/Policy Iteration

CS229: Machine Learning

Outline

MDP: $(S,A,\{P_{sa}\},\gamma,R)$

Value iteration:

Policy iteraion:

Exploration vs. Exploitation trade-off

Lecture 16. Independent Component Analysis & RL

Lecture 18. Continuous State MDP & Model Simulation

0개의 댓글

관련 채용 정보

Lecture 17. MDPs & Value/Policy Iteration

CS229: Machine Learning

Outline

MDP: (S,A,{Psa},γ,R)(S,A,\{P_{sa}\},\gamma,R)(S,A,{Psa​},γ,R)

Value iteration:

Policy iteraion:

Exploration vs. Exploitation trade-off

Lecture 16. Independent Component Analysis & RL

Lecture 18. Continuous State MDP & Model Simulation

0개의 댓글

관련 채용 정보

MDP: $(S,A,\{P_{sa}\},\gamma,R)$