Lecture 18. Continuous State MDP & Model Simulation

cryptnomy·2022년 11월 27일
0

CS229: Machine Learning

목록 보기
18/18
post-thumbnail

Lecture video link: https://youtu.be/QFu5nuc-S0s

Outline

  • Discretization
  • Models / Simulation
  • Fitted value iteration

Recap

MDP: (S,A,{Psa},γ,R)(S,A,\{P_{sa}\},\gamma,R).

Vπ(s)=E[R(s0)+γ(R(s1)+γR(s2)+)Vπ(s1)π,s0=s].V^\pi(s)=\mathbb E[R(s_0)+\gamma\underbrace{\left(R(s_1)+\gamma R(s_2)+\cdots\right)}_{\sim V^\pi(s_1)}|\pi,s_0=s].

V(s)=maxπVπ(s).V^*(s)=\max\limits_\pi V^\pi(s).

π(s)=arg maxasPsa(s)V(s)EsPsa[V(s)].\pi^*(s)=\argmax\limits_a\underbrace{\sum\limits_{s'}P_{sa}(s')V^*(s')}_{\mathbb E_{s'\sim P_{sa}}[V^*(s')]}.

Value iteration:

V(s):=R(s)+maxaγsPsa(s)V(s).V(s):=R(s)+\max\limits_a\gamma\sum\limits_{s'}P_{sa}(s')V(s').

Q. How do you model the state of a car?

A. A common way: You need (1) a position (x,y)(x, y), (2) the orientation (θ)(\theta), (3) velocity (x˙,y˙)(\dot x, \dot y), and (4) angular velocity (θ˙)(\dot\theta). (6-dimensional state representation.)

Last example: The inverted pendulum.

x,θ,x˙,θ˙x,\theta,\dot x,\dot\theta.

In this lecture, we focus on problems where the state-space is S=RnS=\mathbb R^n.

Discretization

.. the most straightforward way to work with a continuous state.

Problems:

(1) Naive representation for VV^* (and π\pi^*.)

.. Analogy

(Source: https://youtu.be/QFu5nuc-S0s?t=859)

(2) Curse of dimensionality. (Richard Bellman had given this name.)

S=RnS=\mathbb R^n, and discretize each dimension into kk values, get knk^n discrete states.

(For high-dimensional state spaces, it is not a good representation.)

Suppose you have 100 machines in a giant factory and each machine can be in kk different states.

k100k^{100} .. total state space.

Approximate VV^* directly without resorting to discretization.

yθTx    θTϕ(x)y\simeq\cancel{\theta^Tx}\;\;\theta^T\phi(x)

where ϕ(x)\phi(x): feature of xx.

V(s)θTϕ(s).V^*(s)\simeq\theta^T\phi(s).

Model (simulator) of MDP

(Source: https://youtu.be/QFu5nuc-S0s?t=1435)

  • Assume the action space to be discrete.
  • The action space is much lower-dimensional than the state space. E.g., For a car, ss is 6-dimensional, and aa is 2-dimensional: steering and braking. For a helicoptoer, ss is 12-dimensional, and aa is 4-dimensional with two control sticks. For an inverted pendulum, ss is 4-dimensional, and aa is 1-dimensional.

Q. How to get model?

A. Physics simulator.

Learn model from data

s0(1)a0(1)s1(1)a1(1)s2(1)a2(1)sT(1)s0(m)\begin{aligned}&s_0^{(1)}\xrightarrow{a_0^{(1)}}s_1^{(1)}\xrightarrow{a_1^{(1)}}s_2^{(1)}\xrightarrow{a_2^{(1)}}\cdots\longrightarrow s_T^{(1)}\\&\vdots\\&s_0^{(m)}\longrightarrow\cdots\end{aligned}

where each superscript denotes the ii-th trajectory.

Apply supervised learning to estimate st+1s_{t+1} as function of st,ats_t,a_t.

E.g., linear regression version:

st+1=Ast+Bat.s_{t+1}=As_t+Ba_t.
minA,Bi=1mt=0Tst+1(i)(Ast(i)+Bat(i))2.\min_{A,B}\sum_{i=1}^m\sum_{t=0}^T||s_{t+1}^{(i)}-(As_t^{(i)}+Ba_t^{(i)})||^2.

Model:

st+1=Ast+Bat(deterministic)s_{t+1}=As_t+Ba_t\qquad\qquad\text{(deterministic)}

or

st+1=Ast+Bat+ϵt(stochastic)s_{t+1}=As_t+Ba_t+\epsilon_t\qquad\qquad\text{(stochastic)}

where ϵN(0,σ2I)\epsilon\sim\mathcal N(0,\sigma^2I).

— Model-based RL.

Andrew Ng comment: I think model-based RL has been taking off faster. A lot of the most promising approaches are model-based RL because if you have a physical robot, you just can’t afford to have a reinforcement learning algorithm bash your robot around for too long. Or how many helicopters do you want to crash before your learning algorithm figures it out?

Model-free RL works fine if you want to play video games because if you’re trying to get a computer or play chess or Othello or Go. You have perfect simulators for the video games which are video games themselves. So your RL algorithm can blow up hundreds of millions of times in a video game.

Although, again, the field is evolving quickly so there’s very interesting work at the intersection of model-based and model-free that gets more complicated.

Q. How to model the distribution of noise ϵt\epsilon_t?

A. One thing you could do is estimate it from data. But as a practical matter, a lot of reinforcement learning algorithms will learn a very brittle model that works in your simulator but doesn’t really work when you put it into your real robot.

If you have a deterministic simulator using these methods, it’s not that hard to generate a cool-looking video of your reinforcement learning algorithm supposedly controlling a five-legged thing or something. But it turns out that deterministic methods are more likely to fail in real robots than in simulator.

It is very important to add some noise to your simulator if you want to your model-based simulator to actually work on a physical robot.

The exactness of distribution of noise actually matters less than adding some noise.

Fitted value iteration

Choose feature ϕ(s)\phi(s) of state ss.

V(s)=θTϕ(s).V(s)=\theta^T\phi(s).

E.g., for an inverted pendulum, define

ϕ(s)=[xx˙x2xx˙xψ].\phi(s)=\begin{bmatrix}x\\\dot x\\x^2\\x\dot x\\x\psi\\\vdots\end{bmatrix}.

Value iteration:

V(s):=R(s)+γmaxasPsa(s)V(s)=R(s)+γmaxaEsPsa[V(s)]=maxaEsPsa[R(s)+γV(s)]=maxaq(a).\begin{aligned}V(s):&=R(s)+\gamma\max_a\sum_{s'}P_{sa}(s')V(s')\\&=R(s)+\gamma\max_a\mathbb E_{s'\sim P_{sa}}[V(s')]\\&=\max_a\mathbb E_{s'\sim P_{sa}}[R(s)+\gamma V(s')]\\&=\max_aq(a).\end{aligned}

Fitted value iteration

Sample {s(1),(2),,s(m)}S\{s^{(1)},^{(2)},\cdots,s^{(m)}\}\subseteq S randomly.

Initialize θ:=0\theta:=0.

Repeat {

    For i=1,,mi=1,\cdots,m {

        For each action aAa\in A {

            Sample s1,s2,,skPs(i)as'_1,s'_2,\cdots,s'_k\sim P_{s^{(i)}a} (← using model).

            Set q(a)=1kj=1k[R(s(i))+γV(sj)]q(a)=\frac{1}{k}\sum\limits_{j=1}^k[R(s^{(i)})+\gamma V(s'_j)].

        }

        Set y(i)=maxaq(a)y^{(i)}=\max\limits_aq(a).

    }

    θ:=arg minθ12i=1m(θTϕ(s(i))y(i))2.\theta:=\argmin\limits_\theta\frac{1}{2}\sum\limits_{i=1}^m\left(\theta^T\phi(s^{(i)})-y^{(i)}\right)^2.

}

Original VI (value iteration):

    V(s(i)):=y(i)V(s^{(i)}):=y^{(i)}.

Fitted VI:

    Want V(s(i))y(i)V(s^{(i)})\simeq y^{(i)}.

    I.e., θTϕ(s(i))y(i)\theta^T\phi(s^{(i)})\simeq y^{(i)}.

Q. How do you choose mm and how do you test for overfitting?

A. Ususally, you might as well set mm to be as big as you feel like subject to the program not taking too long to run.

Fitted VI gives approximation to VV^*.

Implicitly defines π\pi^*.

π(s)=arg maxaEsPsa[V(s)]\pi^*(s)=\argmax\limits_a\mathbb E_{s’\sim P_{sa}}[V^*(s')].

Used s1,,skPsas'_1,\cdots,s'_k\sim P_{sa} to approximate expectation.

Say model is

st+1=f(st,at)+ϵts_{t+1}=f(s_t,a_t)+\epsilon_t

(e.g., Ast+Bat+ϵAs_t+Ba_t+\epsilon)

For deployment(run-time),

Set ϵt=0\epsilon_t=0 and k=1k=1.

When in state ss,

Pick action

arg maxaV(f(s,a))\argmax_aV(f(s,a))

where ff: simulation without noise.

f(s,a)    ..    sPsaf(s,a)\;\;..\;\;s'\sim P_{sa}, but with deterministic simulator.

0개의 댓글