David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 2: Markov Decision Process (Youtube) 강의 내용을 정리했습니다.
Markov Process
Introduction
- Markov decision process(MDP) describe an environment for RL
- Environment is fully observable => MDP
c.f. partially observable => POMDP
- 대부분의 RL 문제들은 MDP로 formalised 가능 (POMDP도 MDP로 convert 할 수 있다!)
Markov property
St is an markov state if and only if:
P(St+1∣St)=P(St+1∣S1,...,St)
- 미래를 예측하기 위해 현재 state만 알고 있으면 충분하다! (history may be thrown away)
State Transition Matrix
- The state transition probability: Pss′=P[St+1=s′∣St=s]
- State transition matrix P defines transition probabilities from all states s to all successor states s'.
- P=⎣⎢⎢⎡P11⋮Pn1…⋱…P1n⋮Pnn⎦⎥⎥⎤
- 일단은 P is known이라고 가정하고 생각하자.
Markov Process
- A Markov process is a sequence of random states S1,S2,... with the Markov property.
A Markov Process (a.k.a. Markov Chain) is a tuple ⟨S,P⟩
- S is a finite set of states
- P is a state transition probability matrix,
Pss′=P[St+1=s′∣St=s]
- Note: Markov Process는 state의 transition만 나타냄
Markov Reward Process (MRP)
MRP
- A Markov reward process is a Markov process with values.
A Markov Reward Process is a tuple ⟨S,P,R,γ⟩
- S is a finite set of states
- P is a state transition probability matrix,
Pss′=P[St+1=s′∣St=s]
- R is a reward function, Rs=E[Rt+1∣St=s] ... 현재 state에서의 immediate reward의 기댓값
- γ is a discount factor, γ∈[0,1]
- Fully observable 이면 R is knwon!
Return
The return Gt is the total discounted reward from time-step t.
Gt=Rt+1+γRt+2+...=k=0∑∞γkRt+k+1
- Note: Rt 하나하나가 확률변수라서 Gt도 확률변수다.
- Why discount?
- 계산하기가 더 편해진다.
- cycle이 있는 경우 infinite return을 방지한다.
- Markov process가 terminate 하는 게 보장되면 undiscounted (γ=1) MRP를 써도 된다.
Value Function
The state-value function v(s) of an MRP is the expected return starting from state s
v(s)=E[Gt∣St=s]
- Trainsition matrix P를 안다고 가정하면, fully observable이라서 각 state에서의 reward도 알 수 있기 때문에 v(s)를 구할 수 있다!
Example: Student MRP
Student MRP with
γ=0 Image from:
here
- γ=0 이므로 v(s) = s에서의 immediate reward의 기댓값!
- 위 예시에서는 도착 지점과 상관 없이 나갈 때 reward가 발생하고 있다. (사실은 그래서 P와도 independent 하다.)
Student MRP with
γ=0.9 Image from:
here
- 오른쪽에서 두 번째에 있는 state를 s'이라고 하자.
- v(s′)=−2+0.9×(0.6×10+0.4×1.9)=−2+6.084=4.084≈4.1
- Bellman Equation에 의해 v(s′)를 위와 같이 구할 수 있다.
Bellman Equation for MRPs
-
v(s)=E[Gt∣St=s] =E[Rt+1+γRt+2+γ2Rt+3+...∣St=s] =E[Rt+1+γ(Rt+2+γRt+3+...)∣St=s] =E[Rt+1+γGt+1∣St=s] =E[Rt+1+γv(St+1)∣St=s]
-
v(s)=Rs+γs′∈S∑Pss′v(s′)
Recall, Rs=E[Rt+1∣St=s]
-
모든 s에 대해 위 식을 세우면 연립해서 v(s) 값을 바로 구할 수 있다. (단, time에 따라 reward가 달라지지 않는다는 가정 필요)
-
Matrix Form: v=R+γPv
where v is a column vector with one entry per state
Image from: here
-
v=(I−γP)−1R
computational complexity is O(n3) ... not practical solution for large n!
=> iterative methods! (e.g. Dynamic programming, Monte-Carlo evaluation, Temporal-Difference learning)
Markov Decision Process (MDP)
MDP
- A Markov decision process is a Markov reward with decisions.
- It is an environment in which all states are Markov.
A Markov Reward Process is a tuple ⟨S,A,P,R,γ⟩
- S is a finite set of states
- A is a finite set of actions
- P is a state transition probability matrix,
Pss′a=P[St+1=s′∣St=s,At=a]
- R is a reward function, Rs=E[Rt+1∣St=s,At=a]
- γ is a discount factor, γ∈[0,1]
- MRP에서는 transition과 reward가 state에 의해 정해졌음. (agent의 선택x)
- MDP에서는 state에서의 agent의 action에 의해 transition 및 reward가 결정됨.
Example: Student MDP
Image from:
here
- 왼쪽 중앙 Node에 위치할 때, 그 위치가 state가 되고, action 선택지는 Facebook과 Study임.
- 위 MDP는 transition 및 reward가 deterministic하기 때문에 Facebook을 선택하면 -1의 reward를 얻고 위쪽 node로 가게 되고, Study를 선택하면 -2의 reward를 얻고 오른쪽으로 가게 된다. (선택하는 방법이 Policy!)
- 아래쪽 Point는 Pub을 선택했을 때, probability에 따라 random하게 transition 되는 걸 나타냄.
Policies
- A policy fully defines the behaviour of an agent.
A policy π is a distribution over actions given states,
π(a∣s)=P[At=a∣St=s]
- state s에서 action a를 할 확률. (action을 확률적으로 정의함)
- MDP policies depend on the current state (not the history! => Policies are time-independent!)
- 고정된 π에 대해, Pπ, Rπ가 정해진다.
where
- Pss′π=a∈A∑π(a∣s)Pss′a
=> policy π를 따랐을 때 s에서 s'으로 transit 할 확률. (Pss′a는 environment에 의해 정해져 있음!)
- Rsπ=a∈A∑π(a∣s)Rsa
=> policy π를 따랐을 때 s에서(s를 나가면서) 얻는 reward의 기댓값. (마찬가지로 Rsa는 environment에 의해 정해져 있음!)
- S1,S2,... is a Markov process ⟨S,Pπ⟩
S1,R2,S2,... is a Markov reward process ⟨S,Pπ,Rπ,γ⟩
Value Function
The state-value function vπ(s) of an MDP is the expected return starting from state s, and then following policy π
vπ(s)=Eπ[Gt∣St=s]
The action-value function qπ(s,a) of an MDP is the expected return starting from state s, taking action a, and then following policy π
qπ(s)=Eπ[Gt∣St=s,At=a]
- Note: Ea∼π(⋅∣s)[qπ(s,a)∣St=s]=a∈A∑π(a∣s)qπ(s,a)=vπ(s)
- 어떤 state에서 action value 값을 비교해 최적의 action을 취한다!
Bellman Expectation Equation
- Note: π is given!
- vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s] =a∈A∑π(a∣s)qπ(s,a) =a∈A∑π(a∣s)(Rsa+γs′∈S∑Pss′avπ(s′)) (아래 참고!)
- qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)∣St=s,At=a] =Eπ[Rt+1+γvπ(St+1)∣St=s,At=a] =Rsa+γs′∈S∑Pss′avπ(s′) =Rsa+γs′∈S∑a′∈A∑π(a′∣s′)qπ(s′,a′)
- Matrix Form: vπ=Rπ+γPπvπ
where Rπ=[R1π,...,Rnπ]T and (Pπ)ij=Pijπ
Example: State-Value Function for Student MDP
Image from:
here
- 왼쪽 중앙 Node에 위치하는 state를 s라고 하자.
- vπ(s)=Eπ[Gt∣St=s] =Eπ[Rt+1+γvπ(St+1)∣St=s] =Rsπ+γs′∈S∑Pss′πvπ(s′)
- Rsπ=0.5×(−1)+0.5×(−2)=−1.5
- γs′∈S∑Pss′πvπ(s′)=1.0×(0.5×(−2.3)+0.5×(2.7))=1.0×0.2=0.2
- vπ(s)=−1.5+0.2=−1.3
- 다른 방법으로 구할 수도 있다.
- vπ(s)=a∈A∑π(a∣s)qπ(s,a) =a∈A∑π(a∣s)(Rsa+γs′∈S∑Pss′avπ(s′)) =0.5×(−1+1.0×(−2.3))+0.5×(−2+1.0×2.7)=−1.3
Optimal Value Function
The optimal state-value function v∗(s) is the maximum value function over all policies
v∗(s)=πmaxvπ(s)
The optimal action-value function q∗(s,a) is the maximum action-value function over all policies
q∗(s,a)=πmaxqπ(s,a)
- q∗를 찾으면 최적의 π를 찾은 것! (각 state에서 어떤 action을 하는 게 가장 좋은지를 알게 되기 때문)
Optimal Policy
- Def: π≥π′ if vπ(s)≥vπ′(s),∀s
- For any MDP,
- There exists an optimal policy π∗ that is better than or equal to all other policies, π∗≥π,∀π
- All optimal policies achieve the optimal state-value function and action-value function,
i.e., vπ∗(s)=v∗(s) and qπ∗(s,a)=q∗(s,a)
- Finding optimal Policy:
- Deterministic optimal policy for any MDP:
π∗(a∣s)=⎩⎪⎨⎪⎧1 if a=a∈Aargmaxq∗(s,a)0 otherwise
- Recall, q∗를 아는 것은 π∗를 아는 것과 마찬가지다!
Example: Optimal Policy for Student MDP
Image from:
here
- 빨간 화살표가 (deterministic) optimal policy!
Bellman Optimality Equation for v∗
- v∗(s)=amaxq∗(s,a)
- ∵v∗(s)=a∈A∑π∗(a∣s)q∗(s,a)
and π∗(a∗,s)=1 where a∗=a∈Aargmaxq∗(s,a)
- Note: this is a necessary and sufficient condition for an optimal policy, i.e.,
vπ(s)=a∈Amaxqπ(s,a),∀s∈S⟺π is an optimal policy
- q∗(s,a)=Rsa+γs′∈S∑Pss′av∗(s′)
- 따라서,
- v∗(s)=amax[Rsa+γs′∈S∑Pss′av∗(s′)]
- q∗(s,a)=Rsa+γs′∈S∑Pss′aa′maxq∗(s′,a′)
- Bellman optimality equation has no closed form solution
- Solve with iterative methods (e.g. Value iteration, Policy iteration, Q-learning, Sarse...)
Example: Bellman Optimality Equation in Student MDP
Image from:
here
Extensions to MDPs
POMDPs
A Partially Observable Markov Decision Process is an MDP with hidden states. It is a hidden Markov model with actions.
A POMDP is a tuple ⟨S,A,O,P,R,Z,γ⟩
- S is a finite set of states
- A is a finite set of actions
- O is a finite set of observations
- P is a state transition probability matrix,
Pss′a=P[St+1=s′∣St=s,At=a]
- R is a reward function, Rs=E[Rt+1∣St=s,At=a]
- Z is an observation function,
Zs′oa=P[Ot+1=o∣St+1=s′,At=a]
- γ is a discount factor, γ∈[0,1]
Belief States
A history Ht is a sequence of actions, observations and rewards,
Ht=A0,O1,R1,...,At−1,Ot,Rt
A belief state b(h) is a probability distribution over states, conditioned on the history h
b(h)=(P[St=s1∣Ht=h],...,P[St=sn∣Ht=h])
Reductions of POMDPs
- The history Ht satisfies the Markov property
=> A POMDP can be reduced to an (infinite) history tree
- The belief state b(Ht) satisfies the Markov property
=> A POMDP can be reduced to an (infinite) belief state tree
혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!