Silver RL (2) Markov Decision Process

Sanghyeok Choi·2021년 12월 22일
0

Intro_to_RL

목록 보기
2/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 2: Markov Decision Process (Youtube) 강의 내용을 정리했습니다.

Markov Process

Introduction

  • Markov decision process(MDP) describe an environment for RL
  • Environment is fully observable => MDP
    c.f. partially observable => POMDP
  • 대부분의 RL 문제들은 MDP로 formalised 가능 (POMDP도 MDP로 convert 할 수 있다!)

Markov property

StS_t is an markov state if and only if:
P(St+1St)=P(St+1S1,...,St)\mathrm{P}(S_{t+1} \mid S_t)=\mathrm{P}(S_{t+1} \mid S_1, ..., S_t)

  • 미래를 예측하기 위해 현재 state만 알고 있으면 충분하다! (history may be thrown away)

State Transition Matrix

  • The state transition probability: Pss=P[St+1=sSt=s]\mathcal{P}_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s]
  • State transition matrix P\mathcal{P} defines transition probabilities from all states s to all successor states s'.
  • P=[P11P1nPn1Pnn]\mathcal{P}=\begin{bmatrix} \mathcal{P}_{11} & \dots & \mathcal{P}_{1n} \\ \vdots & \ddots & \vdots \\ \mathcal{P}_{n1} & \dots & \mathcal{P}_{nn} \end{bmatrix}
  • 일단은 P\mathcal{P} is known이라고 가정하고 생각하자.

Markov Process

  • A Markov process is a sequence of random states S1,S2,...S_1, S_2, ... with the Markov property.

    A Markov Process (a.k.a. Markov Chain) is a tuple S,P\lang S,\mathcal{P} \rang

    • S\mathcal{S} is a finite set of states
    • P\mathcal{P} is a state transition probability matrix,
      Pss=P[St+1=sSt=s]\mathcal{P}_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s]
  • Note: Markov Process는 state의 transition만 나타냄

Markov Reward Process (MRP)

MRP

  • A Markov reward process is a Markov process with values.

    A Markov Reward Process is a tuple S,P,R,γ\lang S,\mathcal{P},\mathcal{R},\gamma \rang

    • S\mathcal{S} is a finite set of states
    • P\mathcal{P} is a state transition probability matrix,
      Pss=P[St+1=sSt=s]\mathcal{P}_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s]
    • R\mathcal{R} is a reward function, Rs=E[Rt+1St=s]\mathcal{R}_s=\mathbb{E}[R_{t+1}|S_t=s] ... 현재 state에서의 immediate reward의 기댓값
    • γ\gamma is a discount factor, γ[0,1]\gamma \in [0,1]
  • Fully observable 이면 R\mathcal{R} is knwon!

Return

The return GtG_t is the total discounted reward from time-step t.
Gt=Rt+1+γRt+2+...=k=0γkRt+k+1G_t = R_{t+1}+\gamma R_{t+2}+... = \sum\limits^{\infin}_{k=0}{\gamma^k R_{t+k+1}}

  • Note: RtR_{t} 하나하나가 확률변수라서 GtG_t도 확률변수다.
  • Why discount?
    • 계산하기가 더 편해진다.
    • cycle이 있는 경우 infinite return을 방지한다.
    • Markov process가 terminate 하는 게 보장되면 undiscounted (γ=1\gamma=1) MRP를 써도 된다.

Value Function

The state-value function v(s) of an MRP is the expected return starting from state s
v(s)=E[GtSt=s]v(s)=\mathbb{E}[G_t|S_t=s]

  • Trainsition matrix P\mathcal{P}를 안다고 가정하면, fully observable이라서 각 state에서의 reward도 알 수 있기 때문에 v(s)v(s)를 구할 수 있다!

Example: Student MRP

Student MRP with γ=0\gamma=0
Image from: here

  • γ=0\gamma=0 이므로 v(s)v(s) = s에서의 immediate reward의 기댓값!
  • 위 예시에서는 도착 지점과 상관 없이 나갈 때 reward가 발생하고 있다. (사실은 그래서 P\mathcal{P}와도 independent 하다.)


Student MRP with γ=0.9\gamma=0.9
Image from: here

  • 오른쪽에서 두 번째에 있는 state를 s'이라고 하자.
    • v(s)=2+0.9×(0.6×10+0.4×1.9)=2+6.084=4.0844.1v(s')=-2+0.9\times(0.6\times10+0.4\times1.9)=-2+6.084=4.084\approx4.1
    • Bellman Equation에 의해 v(s)v(s')를 위와 같이 구할 수 있다.

Bellman Equation for MRPs

  • v(s)=E[GtSt=s]        =E[Rt+1+γRt+2+γ2Rt+3+...St=s]        =E[Rt+1+γ(Rt+2+γRt+3+...)St=s]        =E[Rt+1+γGt+1St=s]        =E[Rt+1+γv(St+1)St=s]v(s)=\mathbb{E}[G_t|S_t=s]\\ \space\space\space\space\space\space\space\space=\mathbb{E}[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...|S_t=s]\\ \space\space\space\space\space\space\space\space=\mathbb{E}[R_{t+1}+\gamma(R_{t+2}+\gamma R_{t+3}+...)|S_t=s]\\ \space\space\space\space\space\space\space\space=\mathbb{E}[R_{t+1}+\gamma G_{t+1}|S_t=s]\\ \space\space\space\space\space\space\space\space=\mathbb{E}[R_{t+1}+\gamma v(S_{t+1})|S_t=s]\\

  • v(s)=Rs+γsSPssv(s)v(s)=\mathcal{R_s}+\gamma \sum\limits_{s'\in \mathcal{S}}{\mathcal{P}_{ss'}v(s')}
    Recall, Rs=E[Rt+1St=s]\mathcal{R_s}=\mathbb{E}[R_{t+1}|S_t=s]

  • 모든 ss에 대해 위 식을 세우면 연립해서 v(s)v(s) 값을 바로 구할 수 있다. (단, time에 따라 reward가 달라지지 않는다는 가정 필요)

  • Matrix Form: v=R+γPv\mathrm{v}=\mathcal{R}+\gamma\mathcal{P}\mathrm{v}
    where v\mathrm{v} is a column vector with one entry per state

    Image from: here

  • v=(IγP)1R\mathrm{v}=(I-\gamma \mathcal{P})^{-1}\mathcal{R}
    computational complexity is O(n3)O(n^3) ... not practical solution for large n!
    => iterative methods! (e.g. Dynamic programming, Monte-Carlo evaluation, Temporal-Difference learning)

Markov Decision Process (MDP)

MDP

  • A Markov decision process is a Markov reward with decisions.
  • It is an environment in which all states are Markov.

    A Markov Reward Process is a tuple S,A,P,R,γ\lang S,\mathcal{A},\mathcal{P}, \mathcal{R},\gamma \rang

    • S\mathcal{S} is a finite set of states
    • A\mathcal{A} is a finite set of actions
    • P\mathcal{P} is a state transition probability matrix,
      Pssa=P[St+1=sSt=s,At=a]\mathcal{P}_{ss'}^a=\mathbb{P}[S_{t+1}=s'|S_t=s, A_t=a]
    • R\mathcal{R} is a reward function, Rs=E[Rt+1St=s,At=a]\mathcal{R}_s=\mathbb{E}[R_{t+1}|S_t=s,A_t=a]
    • γ\gamma is a discount factor, γ[0,1]\gamma \in [0,1]
  • MRP에서는 transition과 reward가 state에 의해 정해졌음. (agent의 선택x)
  • MDP에서는 state에서의 agent의 action에 의해 transition 및 reward가 결정됨.

Example: Student MDP

Image from: here

  • 왼쪽 중앙 Node에 위치할 때, 그 위치가 state가 되고, action 선택지는 Facebook과 Study임.
  • 위 MDP는 transition 및 reward가 deterministic하기 때문에 Facebook을 선택하면 -1의 reward를 얻고 위쪽 node로 가게 되고, Study를 선택하면 -2의 reward를 얻고 오른쪽으로 가게 된다. (선택하는 방법이 Policy!)
  • 아래쪽 Point는 Pub을 선택했을 때, probability에 따라 random하게 transition 되는 걸 나타냄.

Policies

  • A policy fully defines the behaviour of an agent.

    A policy π\pi is a distribution over actions given states,
    π(as)=P[At=aSt=s]\pi(a|s)=\mathbb{P}[A_t=a|S_t=s]

  • state s에서 action a를 할 확률. (action을 확률적으로 정의함)
  • MDP policies depend on the current state (not the history! => Policies are time-independent!)
  • 고정된 π\pi에 대해, Pπ\mathcal{P}^\pi, Rπ\mathcal{R}^\pi가 정해진다.
    where
    • Pssπ=aAπ(as)Pssa\mathcal{P}^\pi_{ss'}=\sum\limits_{a\in\mathcal{A}}\pi(a|s)\mathcal{P}^a_{ss'}
      => policy π\pi를 따랐을 때 s에서 s'으로 transit 할 확률. (Pssa\mathcal{P}^a_{ss'}는 environment에 의해 정해져 있음!)
    • Rsπ=aAπ(as)Rsa\mathcal{R}^\pi_{s}=\sum\limits_{a\in\mathcal{A}}\pi(a|s)\mathcal{R}^a_s
      => policy π\pi를 따랐을 때 s에서(s를 나가면서) 얻는 reward의 기댓값. (마찬가지로 Rsa\mathcal{R}^a_s는 environment에 의해 정해져 있음!)
  • S1,S2,...S_1, S_2, ... is a Markov process S,Pπ\lang \mathcal{S}, \mathcal{P}^\pi \rang
    S1,R2,S2,...S_1, R_2, S_2, ... is a Markov reward process S,Pπ,Rπ,γ\lang \mathcal{S}, \mathcal{P}^\pi,\mathcal{R}^\pi,\gamma \rang

Value Function

The state-value function vπ(s)v_{\pi}(s) of an MDP is the expected return starting from state s, and then following policy π\pi
vπ(s)=Eπ[GtSt=s]v_{\pi}(s)=\mathbb{E}_{\pi}[G_t|S_t=s]

The action-value function qπ(s,a)q_{\pi}(s, a) of an MDP is the expected return starting from state s, taking action a, and then following policy π\pi
qπ(s)=Eπ[GtSt=s,At=a]q_{\pi}(s)=\mathbb{E}_{\pi}[G_t|S_t=s, A_t=a]

  • Note: Eaπ(s)[qπ(s,a)St=s]=aAπ(as)qπ(s,a)=vπ(s)\mathbb{E}_{a\sim\pi(\cdot|s)}[q_{\pi}(s, a)|S_t=s]=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)q_{\pi}(s,a)}=v_{\pi}(s)
  • 어떤 state에서 action value 값을 비교해 최적의 action을 취한다!

Bellman Expectation Equation

  • Note: π\pi is given!
  • vπ(s)=Eπ[Rt+1+γvπ(St+1)St=s]          =aAπ(as)qπ(s,a)          =aAπ(as)(Rsa+γsSPssavπ(s))v_{\pi}(s)=\mathbb{E}_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)q_{\pi}(s,a)}\\ \space\space\space\space\space\space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{a}_{ss'}v_{\pi}(s'))} (아래 참고!)
  • qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)St=s,At=a]              =Eπ[Rt+1+γvπ(St+1)St=s,At=a]              =Rsa+γsSPssavπ(s)              =Rsa+γsSaAπ(as)qπ(s,a)q_{\pi}(s,a)=\mathbb{E}_{\pi}[R_{t+1}+\gamma q_{\pi}(S_{t+1}, A_{t+1})|S_t=s,A_t=a]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_t=s,A_t=a]\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{a}_{ss'}v_{\pi}(s')\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\sum\limits_{a'\in\mathcal{A}}\pi(a'|s')q_{\pi}(s',a')}
  • Matrix Form: vπ=Rπ+γPπvπ\mathrm{v}_{\pi}=\mathcal{R}^{\pi}+\gamma\mathcal{P}^{\pi}\mathrm{v}_{\pi}
    where Rπ=[R1π,...,Rnπ]T\mathcal{R}^{\pi}=[\mathcal{R}^{\pi}_1,...,\mathcal{R}^{\pi}_n]^T and (Pπ)ij=Pijπ(\mathcal{P}^{\pi})_{ij}=\mathcal{P}^{\pi}_{ij}

Example: State-Value Function for Student MDP

Image from: here

  • 왼쪽 중앙 Node에 위치하는 state를 s라고 하자.
    • vπ(s)=Eπ[GtSt=s]          =Eπ[Rt+1+γvπ(St+1)St=s]          =Rsπ+γsSPssπvπ(s)v_{\pi}(s)=\mathbb{E}_{\pi}[G_t|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space=\mathbb{E}_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space=\mathcal{R}_{s}^{\pi}+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{\pi}_{ss'}v_{\pi}(s')\\
    • Rsπ=0.5×(1)+0.5×(2)=1.5\mathcal{R}_{s}^{\pi}=0.5\times(-1)+0.5\times(-2)=-1.5
    • γsSPssπvπ(s)=1.0×(0.5×(2.3)+0.5×(2.7))=1.0×0.2=0.2\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{\pi}_{ss'}v_{\pi}(s')=1.0\times(0.5\times(-2.3)+0.5\times(2.7))=1.0\times0.2=0.2
    • vπ(s)=1.5+0.2=1.3v_{\pi}(s)=-1.5+0.2=-1.3
  • 다른 방법으로 구할 수도 있다.
    • vπ(s)=aAπ(as)qπ(s,a)          =aAπ(as)(Rsa+γsSPssavπ(s))          =0.5×(1+1.0×(2.3))+0.5×(2+1.0×2.7)=1.3v_{\pi}(s)=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)q_{\pi}(s,a)}\\ \space\space\space\space\space\space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{a}_{ss'}v_{\pi}(s'))}\\ \space\space\space\space\space\space\space\space\space\space=0.5\times(-1+1.0\times(-2.3))+0.5\times(-2+1.0\times2.7)=-1.3

Optimal Value Function

The optimal state-value function v(s)v_{*}(s) is the maximum value function over all policies
v(s)=maxπvπ(s)v_{*}(s)=\max\limits_{\pi}{v_{\pi}(s)}

The optimal action-value function q(s,a)q_*(s,a) is the maximum action-value function over all policies
q(s,a)=maxπqπ(s,a)q_*(s,a)=\max\limits_{\pi}{q_{\pi}(s,a)}

  • qq_*를 찾으면 최적의 π\pi를 찾은 것! (각 state에서 어떤 action을 하는 게 가장 좋은지를 알게 되기 때문)

Optimal Policy

  • Def: ππ\pi \geq \pi' if vπ(s)vπ(s),sv_{\pi}(s) \geq v_{\pi'}(s), \forall s
  • For any MDP,
    • There exists an optimal policy π\pi_* that is better than or equal to all other policies, ππ,π\pi_* \geq \pi, \forall\pi
    • All optimal policies achieve the optimal state-value function and action-value function,
      i.e., vπ(s)=v(s)v_{\pi_*}(s)=v_*(s) and qπ(s,a)=q(s,a)q_{\pi_*}(s,a)=q_*(s,a)
  • Finding optimal Policy:
    • Deterministic optimal policy for any MDP:
      π(as)={1     if   a=arg maxaAq(s,a)0     otherwise\pi_*(a|s)=\begin{cases} 1 \space\space\space\space\space if \space\space\space a=\argmax\limits_{a\in\mathcal{A}} q_*(s,a)\\ 0 \space\space\space\space\space otherwise \end{cases}
    • Recall, qq_*를 아는 것은 π\pi_*를 아는 것과 마찬가지다!

Example: Optimal Policy for Student MDP

Image from: here

  • 빨간 화살표가 (deterministic) optimal policy!

Bellman Optimality Equation for vv_*

  • v(s)=maxaq(s,a)v_*(s)=\max\limits_{a}{q_*(s,a)}
    • v(s)=aAπ(as)q(s,a)\because v_*(s)=\sum\limits_{a\in\mathcal{A}}\pi_*(a|s)q_*(s,a)
      and π(a,s)=1\pi_*(a_*,s)=1 where a=arg maxaAq(s,a)a_*=\argmax\limits_{a\in\mathcal{A}}q_*(s,a)
    • Note: this is a necessary and sufficient condition for an optimal policy, i.e.,
      vπ(s)=maxaAqπ(s,a),sS    πv_\pi(s)=\max\limits_{a\in\mathcal{A}}q_\pi(s,a), \forall s\in\mathcal{S} \iff \pi is an optimal policy
  • q(s,a)=Rsa+γsSPssav(s)q_*(s,a)=\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_*(s')}
  • 따라서,
    • v(s)=maxa[Rsa+γsSPssav(s)]v_*(s)=\max\limits_{a}{[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_*(s')}}]
    • q(s,a)=Rsa+γsSPssamaxaq(s,a)q_*(s,a)=\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}\max\limits_{a'}{q_*(s',a')}}
  • Bellman optimality equation has no closed form solution
    • Solve with iterative methods (e.g. Value iteration, Policy iteration, Q-learning, Sarse...)

Example: Bellman Optimality Equation in Student MDP

Image from: here

Extensions to MDPs

POMDPs

A Partially Observable Markov Decision Process is an MDP with hidden states. It is a hidden Markov model with actions.
A POMDP is a tuple S,A,O,P,R,Z,γ\lang \mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \mathcal{Z}, \gamma\rang

  • S\mathcal{S} is a finite set of states
  • A\mathcal{A} is a finite set of actions
  • O\mathcal{O} is a finite set of observations
  • P\mathcal{P} is a state transition probability matrix,
    Pssa=P[St+1=sSt=s,At=a]\mathcal{P}_{ss'}^a=\mathbb{P}[S_{t+1}=s'|S_t=s, A_t=a]
  • R\mathcal{R} is a reward function, Rs=E[Rt+1St=s,At=a]\mathcal{R}_s=\mathbb{E}[R_{t+1}|S_t=s,A_t=a]
  • Z\mathcal{Z} is an observation function,
    Zsoa=P[Ot+1=oSt+1=s,At=a]\mathcal{Z}^a_{s'o}=\mathbb{P}[{\mathcal{O}_{t+1}=o|}S_{t+1}=s',A_t=a]
  • γ\gamma is a discount factor, γ[0,1]\gamma \in [0,1]

Belief States

A history HtH_t is a sequence of actions, observations and rewards,
Ht=A0,O1,R1,...,At1,Ot,RtH_t=A_0,O_1,R_1,...,A_{t-1},O_t,R_t

A belief state b(h)b(h) is a probability distribution over states, conditioned on the history hh
b(h)=(P[St=s1Ht=h],...,P[St=snHt=h])b(h) = (\mathbb{P}[S_t=s^1|H_t=h], ..., \mathbb{P}[S_t=s^n|H_t=h])

Reductions of POMDPs

  • The history HtH_t satisfies the Markov property
    => A POMDP can be reduced to an (infinite) history tree
  • The belief state b(Ht)b(H_t) satisfies the Markov property
    => A POMDP can be reduced to an (infinite) belief state tree

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

profile
Lazy Enthusiast

0개의 댓글

관련 채용 정보