Silver RL (2) Markov Decision Process

Sanghyeok Choi·2021년 12월 22일

Intro_to_RL

목록 보기

2/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 2: Markov Decision Process (Youtube) 강의 내용을 정리했습니다.

Markov Process

Introduction

Markov decision process(MDP) describe an environment for RL
Environment is fully observable => MDP
c.f. partially observable => POMDP
대부분의 RL 문제들은 MDP로 formalised 가능 (POMDP도 MDP로 convert 할 수 있다!)

Markov property

$S_t$ is an markov state if and only if:
$\mathrm{P}(S_{t+1} \mid S_t)=\mathrm{P}(S_{t+1} \mid S_1, ..., S_t)$

미래를 예측하기 위해 현재 state만 알고 있으면 충분하다! (history may be thrown away)

State Transition Matrix

The state transition probability: $\mathcal{P}_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s]$
State transition matrix $\mathcal{P}$ defines transition probabilities from all states s to all successor states s'.
$\mathcal{P}=\begin{bmatrix} \mathcal{P}_{11} & \dots & \mathcal{P}_{1n} \\ \vdots & \ddots & \vdots \\ \mathcal{P}_{n1} & \dots & \mathcal{P}_{nn} \end{bmatrix}$
일단은 $\mathcal{P}$ is known이라고 가정하고 생각하자.

Markov Process

A Markov process is a sequence of random states $S_1, S_2, ...$ with the Markov property.
A Markov Process (a.k.a. Markov Chain) is a tuple $\lang S,\mathcal{P} \rang$
- $\mathcal{S}$ is a finite set of states
- $\mathcal{P}$ is a state transition probability matrix,
  $\mathcal{P}_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s]$
Note: Markov Process는 state의 transition만 나타냄

Markov Reward Process (MRP)

MRP

A Markov reward process is a Markov process with values.
A Markov Reward Process is a tuple $\lang S,\mathcal{P},\mathcal{R},\gamma \rang$
- $\mathcal{S}$ is a finite set of states
- $\mathcal{P}$ is a state transition probability matrix,
  $\mathcal{P}_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s]$
- $\mathcal{R}$ is a reward function, $\mathcal{R}_s=\mathbb{E}[R_{t+1}|S_t=s]$ ... 현재 state에서의 immediate reward의 기댓값
- $\gamma$ is a discount factor, $\gamma \in [0,1]$
Fully observable 이면 $\mathcal{R}$ is knwon!

Return

The return $G_t$ is the total discounted reward from time-step t.
$G_t = R_{t+1}+\gamma R_{t+2}+... = \sum\limits^{\infin}_{k=0}{\gamma^k R_{t+k+1}}$

Note: $R_{t}$ 하나하나가 확률변수라서 $G_t$ 도 확률변수다.
Why discount?
- 계산하기가 더 편해진다.
- cycle이 있는 경우 infinite return을 방지한다.
- Markov process가 terminate 하는 게 보장되면 undiscounted ( $\gamma=1$ ) MRP를 써도 된다.

Value Function

The state-value function v(s) of an MRP is the expected return starting from state s
$v(s)=\mathbb{E}[G_t|S_t=s]$

Trainsition matrix $\mathcal{P}$ 를 안다고 가정하면, fully observable이라서 각 state에서의 reward도 알 수 있기 때문에 $v(s)$ 를 구할 수 있다!

Example: Student MRP

Student MRP with

\gamma=0

Image from: here

$\gamma=0$ 이므로 $v(s)$ = s에서의 immediate reward의 기댓값!
위 예시에서는 도착 지점과 상관 없이 나갈 때 reward가 발생하고 있다. (사실은 그래서 $\mathcal{P}$ 와도 independent 하다.)

Student MRP with

\gamma=0.9

Image from: here

오른쪽에서 두 번째에 있는 state를 s'이라고 하자.
- $v(s')=-2+0.9\times(0.6\times10+0.4\times1.9)=-2+6.084=4.084\approx4.1$
- Bellman Equation에 의해 $v(s')$ 를 위와 같이 구할 수 있다.

Bellman Equation for MRPs

$v(s)=\mathbb{E}[G_t|S_t=s]\\ \space\space\space\space\space\space\space\space=\mathbb{E}[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...|S_t=s]\\ \space\space\space\space\space\space\space\space=\mathbb{E}[R_{t+1}+\gamma(R_{t+2}+\gamma R_{t+3}+...)|S_t=s]\\ \space\space\space\space\space\space\space\space=\mathbb{E}[R_{t+1}+\gamma G_{t+1}|S_t=s]\\ \space\space\space\space\space\space\space\space=\mathbb{E}[R_{t+1}+\gamma v(S_{t+1})|S_t=s]\\$
$v(s)=\mathcal{R_s}+\gamma \sum\limits_{s'\in \mathcal{S}}{\mathcal{P}_{ss'}v(s')}$
Recall, $\mathcal{R_s}=\mathbb{E}[R_{t+1}|S_t=s]$
모든 $s$ 에 대해 위 식을 세우면 연립해서 $v(s)$ 값을 바로 구할 수 있다. (단, time에 따라 reward가 달라지지 않는다는 가정 필요)
Matrix Form: $\mathrm{v}=\mathcal{R}+\gamma\mathcal{P}\mathrm{v}$
where $\mathrm{v}$ is a column vector with one entry per state

Image from: here
$\mathrm{v}=(I-\gamma \mathcal{P})^{-1}\mathcal{R}$
computational complexity is $O(n^3)$ ... not practical solution for large n!
=> iterative methods! (e.g. Dynamic programming, Monte-Carlo evaluation, Temporal-Difference learning)

Markov Decision Process (MDP)

MDP

A Markov decision process is a Markov reward with decisions.
It is an environment in which all states are Markov.
A Markov Reward Process is a tuple $\lang S,\mathcal{A},\mathcal{P}, \mathcal{R},\gamma \rang$
- $\mathcal{S}$ is a finite set of states
- $\mathcal{A}$ is a finite set of actions
- $\mathcal{P}$ is a state transition probability matrix,
  $\mathcal{P}_{ss'}^a=\mathbb{P}[S_{t+1}=s'|S_t=s, A_t=a]$
- $\mathcal{R}$ is a reward function, $\mathcal{R}_s=\mathbb{E}[R_{t+1}|S_t=s,A_t=a]$
- $\gamma$ is a discount factor, $\gamma \in [0,1]$
MRP에서는 transition과 reward가 state에 의해 정해졌음. (agent의 선택x)
MDP에서는 state에서의 agent의 action에 의해 transition 및 reward가 결정됨.

Example: Student MDP

Image from: here

왼쪽 중앙 Node에 위치할 때, 그 위치가 state가 되고, action 선택지는 Facebook과 Study임.
위 MDP는 transition 및 reward가 deterministic하기 때문에 Facebook을 선택하면 -1의 reward를 얻고 위쪽 node로 가게 되고, Study를 선택하면 -2의 reward를 얻고 오른쪽으로 가게 된다. (선택하는 방법이 Policy!)
아래쪽 Point는 Pub을 선택했을 때, probability에 따라 random하게 transition 되는 걸 나타냄.

Policies

A policy fully defines the behaviour of an agent.

A policy $\pi$ is a distribution over actions given states,
$\pi(a|s)=\mathbb{P}[A_t=a|S_t=s]$
state s에서 action a를 할 확률. (action을 확률적으로 정의함)
MDP policies depend on the current state (not the history! => Policies are time-independent!)
고정된 $\pi$ 에 대해, $\mathcal{P}^\pi$ , $\mathcal{R}^\pi$ 가 정해진다.
where
- $\mathcal{P}^\pi_{ss'}=\sum\limits_{a\in\mathcal{A}}\pi(a|s)\mathcal{P}^a_{ss'}$
  => policy $\pi$ 를 따랐을 때 s에서 s'으로 transit 할 확률. ( $\mathcal{P}^a_{ss'}$ 는 environment에 의해 정해져 있음!)
- $\mathcal{R}^\pi_{s}=\sum\limits_{a\in\mathcal{A}}\pi(a|s)\mathcal{R}^a_s$
  => policy $\pi$ 를 따랐을 때 s에서(s를 나가면서) 얻는 reward의 기댓값. (마찬가지로 $\mathcal{R}^a_s$ 는 environment에 의해 정해져 있음!)
$S_1, S_2, ...$ is a Markov process $\lang \mathcal{S}, \mathcal{P}^\pi \rang$
$S_1, R_2, S_2, ...$ is a Markov reward process $\lang \mathcal{S}, \mathcal{P}^\pi,\mathcal{R}^\pi,\gamma \rang$

Value Function

The state-value function $v_{\pi}(s)$ of an MDP is the expected return starting from state s, and then following policy $\pi$
$v_{\pi}(s)=\mathbb{E}_{\pi}[G_t|S_t=s]$

The action-value function $q_{\pi}(s, a)$ of an MDP is the expected return starting from state s, taking action a, and then following policy $\pi$
$q_{\pi}(s)=\mathbb{E}_{\pi}[G_t|S_t=s, A_t=a]$

Note: $\mathbb{E}_{a\sim\pi(\cdot|s)}[q_{\pi}(s, a)|S_t=s]=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)q_{\pi}(s,a)}=v_{\pi}(s)$
어떤 state에서 action value 값을 비교해 최적의 action을 취한다!

Bellman Expectation Equation

Note: $\pi$ is given!
$v_{\pi}(s)=\mathbb{E}_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)q_{\pi}(s,a)}\\ \space\space\space\space\space\space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{a}_{ss'}v_{\pi}(s'))}$ (아래 참고!)
$q_{\pi}(s,a)=\mathbb{E}_{\pi}[R_{t+1}+\gamma q_{\pi}(S_{t+1}, A_{t+1})|S_t=s,A_t=a]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_t=s,A_t=a]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{a}_{ss'}v_{\pi}(s')\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\sum\limits_{a'\in\mathcal{A}}\pi(a'|s')q_{\pi}(s',a')}$
Matrix Form: $\mathrm{v}_{\pi}=\mathcal{R}^{\pi}+\gamma\mathcal{P}^{\pi}\mathrm{v}_{\pi}$
where $\mathcal{R}^{\pi}=[\mathcal{R}^{\pi}_1,...,\mathcal{R}^{\pi}_n]^T$ and $(\mathcal{P}^{\pi})_{ij}=\mathcal{P}^{\pi}_{ij}$

Example: State-Value Function for Student MDP

Image from: here

왼쪽 중앙 Node에 위치하는 state를 s라고 하자.
- $v_{\pi}(s)=\mathbb{E}_{\pi}[G_t|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space=\mathcal{R}_{s}^{\pi}+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{\pi}_{ss'}v_{\pi}(s')\\$
- $\mathcal{R}_{s}^{\pi}=0.5\times(-1)+0.5\times(-2)=-1.5$
- $\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{\pi}_{ss'}v_{\pi}(s')=1.0\times(0.5\times(-2.3)+0.5\times(2.7))=1.0\times0.2=0.2$
- $v_{\pi}(s)=-1.5+0.2=-1.3$
다른 방법으로 구할 수도 있다.
- $v_{\pi}(s)=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)q_{\pi}(s,a)}\\ \space\space\space\space\space\space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{a}_{ss'}v_{\pi}(s'))}\\ \space\space\space\space\space\space\space\space\space\space=0.5\times(-1+1.0\times(-2.3))+0.5\times(-2+1.0\times2.7)=-1.3$

Optimal Value Function

The optimal state-value function $v_{*}(s)$ is the maximum value function over all policies
$v_{*}(s)=\max\limits_{\pi}{v_{\pi}(s)}$

The optimal action-value function $q_*(s,a)$ is the maximum action-value function over all policies
$q_*(s,a)=\max\limits_{\pi}{q_{\pi}(s,a)}$

$q_*$ 를 찾으면 최적의 $\pi$ 를 찾은 것! (각 state에서 어떤 action을 하는 게 가장 좋은지를 알게 되기 때문)

Optimal Policy

Def: $\pi \geq \pi'$ if $v_{\pi}(s) \geq v_{\pi'}(s), \forall s$
For any MDP,
- There exists an optimal policy $\pi_*$ that is better than or equal to all other policies, $\pi_* \geq \pi, \forall\pi$
- All optimal policies achieve the optimal state-value function and action-value function,
  i.e., $v_{\pi_*}(s)=v_*(s)$ and $q_{\pi_*}(s,a)=q_*(s,a)$
Finding optimal Policy:
- Deterministic optimal policy for any MDP:
  $\pi_*(a|s)=\begin{cases} 1 \space\space\space\space\space if \space\space\space a=\argmax\limits_{a\in\mathcal{A}} q_*(s,a)\\ 0 \space\space\space\space\space otherwise \end{cases}$
- Recall, $q_*$ 를 아는 것은 $\pi_*$ 를 아는 것과 마찬가지다!

Example: Optimal Policy for Student MDP

Image from: here

빨간 화살표가 (deterministic) optimal policy!

Bellman Optimality Equation for $v_*$

$v_*(s)=\max\limits_{a}{q_*(s,a)}$
- $\because v_*(s)=\sum\limits_{a\in\mathcal{A}}\pi_*(a|s)q_*(s,a)$
  and $\pi_*(a_*,s)=1$ where $a_*=\argmax\limits_{a\in\mathcal{A}}q_*(s,a)$
- Note: this is a necessary and sufficient condition for an optimal policy, i.e.,
  $v_\pi(s)=\max\limits_{a\in\mathcal{A}}q_\pi(s,a), \forall s\in\mathcal{S} \iff \pi$ is an optimal policy
$q_*(s,a)=\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_*(s')}$
따라서,
- $v_*(s)=\max\limits_{a}{[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_*(s')}}]$
- $q_*(s,a)=\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}\max\limits_{a'}{q_*(s',a')}}$
Bellman optimality equation has no closed form solution
- Solve with iterative methods (e.g. Value iteration, Policy iteration, Q-learning, Sarse...)

Example: Bellman Optimality Equation in Student MDP

Image from: here

Extensions to MDPs

POMDPs

A Partially Observable Markov Decision Process is an MDP with hidden states. It is a hidden Markov model with actions.
A POMDP is a tuple $\lang \mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \mathcal{Z}, \gamma\rang$

$\mathcal{S}$ is a finite set of states

$\mathcal{A}$ is a finite set of actions

$\mathcal{O}$ is a finite set of observations

$\mathcal{P}$ is a state transition probability matrix,
$\mathcal{P}_{ss'}^a=\mathbb{P}[S_{t+1}=s'|S_t=s, A_t=a]$

$\mathcal{R}$ is a reward function, $\mathcal{R}_s=\mathbb{E}[R_{t+1}|S_t=s,A_t=a]$

$\mathcal{Z}$ is an observation function,
$\mathcal{Z}^a_{s'o}=\mathbb{P}[{\mathcal{O}_{t+1}=o|}S_{t+1}=s',A_t=a]$

$\gamma$ is a discount factor, $\gamma \in [0,1]$

Belief States

A history $H_t$ is a sequence of actions, observations and rewards,
$H_t=A_0,O_1,R_1,...,A_{t-1},O_t,R_t$

A belief state $b(h)$ is a probability distribution over states, conditioned on the history $h$
$b(h) = (\mathbb{P}[S_t=s^1|H_t=h], ..., \mathbb{P}[S_t=s^n|H_t=h])$

Reductions of POMDPs

The history $H_t$ satisfies the Markov property
=> A POMDP can be reduced to an (infinite) history tree
The belief state $b(H_t)$ satisfies the Markov property
=> A POMDP can be reduced to an (infinite) belief state tree

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

Sanghyeok Choi

Lazy Enthusiast

이전 포스트

Silver RL (1) Introduction to Reinforcement Learning

다음 포스트

Silver RL (2) Markov Decision Process

Intro_to_RL

Markov Process

Introduction

Markov property

State Transition Matrix

Markov Process

Markov Reward Process (MRP)

MRP

Return

Value Function

Example: Student MRP

Bellman Equation for MRPs

Markov Decision Process (MDP)

MDP

Example: Student MDP

Policies

Value Function

Bellman Expectation Equation

Example: State-Value Function for Student MDP

Optimal Value Function

Optimal Policy

Example: Optimal Policy for Student MDP

Bellman Optimality Equation for $v_*$

Example: Bellman Optimality Equation in Student MDP

Extensions to MDPs

POMDPs

Belief States

Reductions of POMDPs

Silver RL (1) Introduction to Reinforcement Learning

Silver RL (3) Planning by Dynamic Programming

0개의 댓글

관련 채용 정보

Silver RL (2) Markov Decision Process

Intro_to_RL

Markov Process

Introduction

Markov property

State Transition Matrix

Markov Process

Markov Reward Process (MRP)

MRP

Return

Value Function

Example: Student MRP

Bellman Equation for MRPs

Markov Decision Process (MDP)

MDP

Example: Student MDP

Policies

Value Function

Bellman Expectation Equation

Example: State-Value Function for Student MDP

Optimal Value Function

Optimal Policy

Example: Optimal Policy for Student MDP

Bellman Optimality Equation for v∗v_*v∗​

Example: Bellman Optimality Equation in Student MDP

Extensions to MDPs

POMDPs

Belief States

Reductions of POMDPs

Silver RL (1) Introduction to Reinforcement Learning

Silver RL (3) Planning by Dynamic Programming

0개의 댓글

관련 채용 정보

Bellman Optimality Equation for $v_*$