Lecture 18. Continuous State MDP & Model Simulation

cryptnomy·2022년 11월 27일

ML cs229 machine learning

CS229: Machine Learning

목록 보기

18/18

Lecture video link: https://youtu.be/QFu5nuc-S0s

Outline

Discretization
Models / Simulation
Fitted value iteration

Recap

MDP: $(S,A,\{P_{sa}\},\gamma,R)$ .

$V^\pi(s)=\mathbb E[R(s_0)+\gamma\underbrace{\left(R(s_1)+\gamma R(s_2)+\cdots\right)}_{\sim V^\pi(s_1)}|\pi,s_0=s].$

$V^*(s)=\max\limits_\pi V^\pi(s).$

$\pi^*(s)=\argmax\limits_a\underbrace{\sum\limits_{s'}P_{sa}(s')V^*(s')}_{\mathbb E_{s'\sim P_{sa}}[V^*(s')]}.$

Value iteration:

$V(s):=R(s)+\max\limits_a\gamma\sum\limits_{s'}P_{sa}(s')V(s').$

Q. How do you model the state of a car?

A. A common way: You need (1) a position $(x, y)$ , (2) the orientation $(\theta)$ , (3) velocity $(\dot x, \dot y)$ , and (4) angular velocity $(\dot\theta)$ . (6-dimensional state representation.)

Last example: The inverted pendulum.

→ $x,\theta,\dot x,\dot\theta$ .

In this lecture, we focus on problems where the state-space is $S=\mathbb R^n$ .

Discretization

.. the most straightforward way to work with a continuous state.

Problems:

(1) Naive representation for $V^*$ (and $\pi^*$ .)

.. Analogy

(Source: https://youtu.be/QFu5nuc-S0s?t=859)

(2) Curse of dimensionality. (Richard Bellman had given this name.)

$S=\mathbb R^n$ , and discretize each dimension into $k$ values, get $k^n$ discrete states.

(For high-dimensional state spaces, it is not a good representation.)

Suppose you have 100 machines in a giant factory and each machine can be in $k$ different states.

→ $k^{100}$ .. total state space.

Approximate $V^*$ directly without resorting to discretization.

y\simeq\cancel{\theta^Tx}\;\;\theta^T\phi(x)

where $\phi(x)$ : feature of $x$ .

V^*(s)\simeq\theta^T\phi(s).

Model (simulator) of MDP

(Source: https://youtu.be/QFu5nuc-S0s?t=1435)

Assume the action space to be discrete.
The action space is much lower-dimensional than the state space. E.g., For a car, $s$ is 6-dimensional, and $a$ is 2-dimensional: steering and braking. For a helicoptoer, $s$ is 12-dimensional, and $a$ is 4-dimensional with two control sticks. For an inverted pendulum, $s$ is 4-dimensional, and $a$ is 1-dimensional.

Q. How to get model?

A. Physics simulator.

Learn model from data

\begin{aligned}&s_0^{(1)}\xrightarrow{a_0^{(1)}}s_1^{(1)}\xrightarrow{a_1^{(1)}}s_2^{(1)}\xrightarrow{a_2^{(1)}}\cdots\longrightarrow s_T^{(1)}\\&\vdots\\&s_0^{(m)}\longrightarrow\cdots\end{aligned}

where each superscript denotes the $i$ -th trajectory.

Apply supervised learning to estimate $s_{t+1}$ as function of $s_t,a_t$ .

E.g., linear regression version:

s_{t+1}=As_t+Ba_t.

\min_{A,B}\sum_{i=1}^m\sum_{t=0}^T||s_{t+1}^{(i)}-(As_t^{(i)}+Ba_t^{(i)})||^2.

Model:

s_{t+1}=As_t+Ba_t\qquad\qquad\text{(deterministic)}

s_{t+1}=As_t+Ba_t+\epsilon_t\qquad\qquad\text{(stochastic)}

where $\epsilon\sim\mathcal N(0,\sigma^2I)$ .

— Model-based RL.

Andrew Ng comment: I think model-based RL has been taking off faster. A lot of the most promising approaches are model-based RL because if you have a physical robot, you just can’t afford to have a reinforcement learning algorithm bash your robot around for too long. Or how many helicopters do you want to crash before your learning algorithm figures it out?

Model-free RL works fine if you want to play video games because if you’re trying to get a computer or play chess or Othello or Go. You have perfect simulators for the video games which are video games themselves. So your RL algorithm can blow up hundreds of millions of times in a video game.

Although, again, the field is evolving quickly so there’s very interesting work at the intersection of model-based and model-free that gets more complicated.

Q. How to model the distribution of noise $\epsilon_t$ ?

A. One thing you could do is estimate it from data. But as a practical matter, a lot of reinforcement learning algorithms will learn a very brittle model that works in your simulator but doesn’t really work when you put it into your real robot.

If you have a deterministic simulator using these methods, it’s not that hard to generate a cool-looking video of your reinforcement learning algorithm supposedly controlling a five-legged thing or something. But it turns out that deterministic methods are more likely to fail in real robots than in simulator.

It is very important to add some noise to your simulator if you want to your model-based simulator to actually work on a physical robot.

The exactness of distribution of noise actually matters less than adding some noise.

Fitted value iteration

Choose feature $\phi(s)$ of state $s$ .

V(s)=\theta^T\phi(s).

E.g., for an inverted pendulum, define

\phi(s)=\begin{bmatrix}x\\\dot x\\x^2\\x\dot x\\x\psi\\\vdots\end{bmatrix}.

Value iteration:

\begin{aligned}V(s):&=R(s)+\gamma\max_a\sum_{s'}P_{sa}(s')V(s')\\&=R(s)+\gamma\max_a\mathbb E_{s'\sim P_{sa}}[V(s')]\\&=\max_a\mathbb E_{s'\sim P_{sa}}[R(s)+\gamma V(s')]\\&=\max_aq(a).\end{aligned}

Fitted value iteration

Sample $\{s^{(1)},^{(2)},\cdots,s^{(m)}\}\subseteq S$ randomly.

Initialize $\theta:=0$ .

Repeat {

For $i=1,\cdots,m$ {

For each action $a\in A$ {

Sample $s'_1,s'_2,\cdots,s'_k\sim P_{s^{(i)}a}$ (← using model).

Set $q(a)=\frac{1}{k}\sum\limits_{j=1}^k[R(s^{(i)})+\gamma V(s'_j)]$ .

}

Set $y^{(i)}=\max\limits_aq(a)$ .

}

$\theta:=\argmin\limits_\theta\frac{1}{2}\sum\limits_{i=1}^m\left(\theta^T\phi(s^{(i)})-y^{(i)}\right)^2.$

}

Original VI (value iteration):

$V(s^{(i)}):=y^{(i)}$ .

Fitted VI:

Want $V(s^{(i)})\simeq y^{(i)}$ .

I.e., $\theta^T\phi(s^{(i)})\simeq y^{(i)}$ .

Q. How do you choose $m$ and how do you test for overfitting?

A. Ususally, you might as well set $m$ to be as big as you feel like subject to the program not taking too long to run.

Fitted VI gives approximation to $V^*$ .

Implicitly defines $\pi^*$ .

$\pi^*(s)=\argmax\limits_a\mathbb E_{s’\sim P_{sa}}[V^*(s')]$ .

Used $s'_1,\cdots,s'_k\sim P_{sa}$ to approximate expectation.

Say model is

$s_{t+1}=f(s_t,a_t)+\epsilon_t$

(e.g., $As_t+Ba_t+\epsilon$ )

For deployment(run-time),

Set $\epsilon_t=0$ and $k=1$ .

When in state $s$ ,

Pick action

\argmax_aV(f(s,a))

where $f$ : simulation without noise.

$f(s,a)\;\;..\;\;s'\sim P_{sa}$ , but with deterministic simulator.

cryptnomy

이전 포스트