David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 3: Planning by Dynamic Programming (Youtube) 강의 내용을 정리했습니다.
Introduction
What is Dynamic Programming (DP)?
- Dynamic: sequential or temporal component to the problem
- Programming: optimizing a "program", i.e. a policy
- Dynamic Programming: Optimization for sequential problem by breaking them down into subproblems
- RL ⊂ DP
Requirements for Dynamic Programming
- 1) Optimal substructure
- Principle of optimality applies
- Principle of optimality: optimal solution can be decomposed into subproblems
- 2) Overlapping subproblems
- Subproblems recur many times
- Solutions can be cached and reused
- MDP는 위 두 조건을 만족한다.
- Bellman equation gives recursive decomposition
v(s)=E[Rt+1+γv(st+1)∣St=s]
- Value function stores and reuses solutions
Planning by Dynamic Programming
- Recall, planning is not a 'full' RL problem.
- Planning: The environment is fully known
- Reinforcement Learning: The environment is initially unknown
- DP assumes full knowledge of the MDP, and used for planning in the MDP
- MDP를 안다 <=> Environment를 안다
- Prediction = (Policy) Evaluation
- Policy가 정해져 있을 때 그 policy가 얼마나 좋은지 평가한다.
- Input: MDP ⟨S,A,P,R,γ⟩ and policy π
or: MRP ⟨S,Pπ,Rπ,γ⟩
- Output: value function vπ
- Control
- Optimal policy를 찾는다.
- Input: MDP ⟨S,A,P,R,γ⟩
- Output: optimal value function v∗ and optimal policy π∗
- "Solving MDP"라고 하면 보통 Control 문제를 푸는 걸 의미함
Policy Evaluation
Iterative Policy Evaluation
- Evaluate a given policy π by iterative application of Bellman expectation backup
- State-value function, vπ를 구하는 과정
- Recall, Bellman expectation eqn':
vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s] =a∈A∑π(a∣s)qπ(s,a) =a∈A∑π(a∣s)(Rsa+γs′∈S∑Pss′avπ(s′))
- In matrix form, vπ=Rπ+γPπvπ
=> Direct Solution: vπ=(I−γPπ)−1Rπ
(inverse의 계산복잡도가 너무 커서 DP를 이용해 vπ를 구한다!)
- v1→v2→v3→...→vπ
Arbitrary vector v1을 iteratively update, vπ로 수렴!
Note: vk=[vk(1),vk(2),...,vk(n)]T
- vk+1(s)=a∈A∑π(a∣s)(Rsa+γs′∈S∑Pss′avk(s′))
- Bellman equation을 iteratively 적용해서 vk로 vk+1을 구할 수 있다!
- 계속 하다 보면 vπ로 수렴한다. (증명은 뒤에서 할 예정)
Example: Iterative Policy Evaluation in Small Gridworld
- 규칙
- Agent follows uniform random policy (단, grid 밖으로는 움직일 수 없다.)
- 좌상단, 우하단 grid에 도착하면 terminate
- Reward is -1 every step until the terminal state is reached
(가능한 빨리 terminal state에 도착하는 게 좋다)
- γ=1.0
Image from:
here
- 빨간색 동그라미 표시한 곳에 위치하는 state를 s라고 하면,
v2(s)=Rπ+γs′∈S∑Pss′πv1(s′) =(−1.0)+(1.0)×[31×0+32×(−1.0)] =−1.666...≈−1.7
Policy Iteration
Policy Iteration
- Given a policy π
- Evaluate the policy π => vπ(s)
- Improve the policy by acting greedily w.r.t vπ => π′=greedy(vπ)
- 이 과정을 반복하면 π∗으로 수렴한다.
Image from:
here
- 1) Policy evaluation : Itrative policy evaluation
2) Policy improvement : Greedy policy improvement
Policy Improvement
- First, consider a deterministic policy, a=π(s) ... state s에서는 action a를 한다.
- By acting greedily,
π′(s)=a∈Aargmaxqπ(s,a)=a∈Aargmax[Rsa+γs′∈S∑Pss′avπ(s′)]=a∗
- Then,
qπ(s,π′(s))=qπ(s,a∗)=a∈Amaxqπ(s,a)≥qπ(s,π(s))=vπ(s)
Note: 부등식 좌변은 s에서 한 번만 π′으로 action하고 그 이후로는 π를 따를 때의 value
- It therefore improves the value function
vπ(s)≤qπ(s,π′(s))=Eπ′[Rt+1+γvπ(St+1)∣St=s] ≤Eπ′[Rt+1+γqπ(St+1,π′(St+1))∣St=s] ≤Eπ′[Rt+1+γRt+2+γ2qπ(St+2,π′(St+2)∣St=s] ≤Eπ′[Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+...∣St=s]=vπ′(s)
i.e., greedily updated new policy is at least equal to previous policy!
- If improvements stop, i.e.,
qπ(s,π′(s))=a∈Amaxqπ(s,a)=qπ(s,π(s))=vπ(s),∀s∈S
즉, improvement를 해도 action-value가 그대로인 경우 (e.g. π′(s)=π(s))
then the Bellman optimality equation has been satisfied, so π is an optimal policy, i.e.,
vπ(s)=a∈Amaxqπ(s,a),∀s∈S⟹vπ(s)=v∗(s),∀s∈S
- Recall, optimal policy의 필요충분조건 (2장)
vπ(s)=a∈Amaxqπ(s,a),∀s∈S⟺π is an optimal policy
Generalized Policy Iteration
-
1) Policy evaluation : Any policy evaluation algorithm
2) Policy improvement : Any policy improvement algorithm
-
Policy evaluation를 vπ에 수렴할 때까지 할 필요는 없다!
=> stop after k iterations of iterative policy evaluation!
=> 참고로 k=1일때 value iteration이 됨! (v를 update 할 때마다 policy도 update)
Value Iteration
Principle of Optimality Theorem
- An optimal solution can be decomposed into subproblems!
A policy π(a∣s) achieves the optimal value from state s, i.e., vπ(s)=v∗(s), if and only if:
for any state s′ reachable from s, π achieves the optimal value from state s′, i.e., vπ(s′)=v∗(s′)
=> 모든 가능한 next states에서 optimal인 policy는 현재 state에서도 optimal이다. (An optimal first action을 찾을 수 있음)
=> 마찬가지로 π가 s에서 optimal이기 위해서는 π가 s′에서도 optimal이어야 한다. (vπ(s)에 future reward도 포함되기 때문)
- c.f. Bellman's Principle of Optimality:
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
Deterministic Value Iteration
- Principle of Optimality Theorem의 ⟸ 방향을 이용한다.
- We know the solution to subproblems v∗(s′),∀s′∈S′ ... (S′ is a set of successor states of s)
- Then solution v∗(s) can be found by one-step lookahead.
v∗(s)←a∈Amax[Rsa+γs′∈S∑Pss′av∗(s′)]
- Find optimal policy by iterative application of Bellman optimality backup
- Arbitrarily initialize v1
- vk+1(s)←a∈Amax[Rsa+γs′∈S∑Pss′avk(s′)]
이걸 모든 state에 대해 동시에, iteratively 적용하면 v∗로 수렴한다.
In a matrix form, vk+1=a∈Amax[Ra+γPavk]
- v1→v2→...→v∗
- Unlike policy iteration, there is no explicit policy
update 할 때 max를 취하기 때문에 매 step마다 greedy action을 선택한다고 볼 수 있다.
(=Policy iteration with 1 step policy evaluation)
- v∗로 greedy policy를 구하면 optimal policy가 된다.
π∗(s)=a∈Aargmaxq∗(s,a)=a∈Aargmax[Rsa+γs′∈S∑Pss′av∗(s′)]
Policy Iteration vs Value Iteration
- Policy iteration:
- Policy evaluation
vk+1(s)=a∈A∑π(a∣s)(Rsa+γs′∈S∑Pss′avk(s′))
v1→v2→...→vπ
- Policy improvement
π′(s)=a∈Aargmaxqπ(s,a)=a∈Aargmax[Rsa+γs′∈S∑Pss′avπ(s′)]=a∗
- Value iteration:
- vk+1(s)←a∈Amax[Rsa+γs′∈S∑Pss′avk(s′)]
- v1→v2→...→v∗
- Show policy iteration with 1 step policy evaluation = value iteration:
- 1 step policy evaluation
vk(s)=a∈A∑π(a∣s)(Rsa+γs′∈S∑Pss′avk−1(s′))
- Policy improvement
π′(s)=a∈Aargmaxqπ(s,a)=a∈Aargmax[Rsa+γs′∈S∑Pss′avk(s′)]=a∗
- Again, 1 step policy evaluation:
vk+1(s)=a∈A∑π′(a∣s)(Rsa+γs′∈S∑Pss′avk(s′)) =Rsa∗+γs∈S∑Pss′a∗vk(s′) =a∈Amax[Rsa+γs′∈S∑Pss′avk(s′)]
Example: Shortest Path
Image from:
here
- 빨간색, 파란색 동그라미 표시한 곳에 위치하는 state를 각각 sr, sb 라고 하면,
- v3(sr)=amax[Rsra+γs′∈S∑Psrs′av2(s′)]=−1+1.0×0=−1
- v3(sb)=amax[Rsba+γs′∈S∑Psbs′av2(s′)]=−1+1.0×(−1)=−2
Summary of DP Algorithms
Image from:
here
- 지금까지는 state-value function으로 계산했다. (vπ(s) or v∗(s))
- Complexity: O(mn2) per iteration, for m actions and n states
- 똑같은 logic을 action-value function에 적용할 수도 있다. (qπ(s,a) or q∗(s,a))
- Complexity: O(m2n2) per iteration
- 나중에 이 방법을 쓰게 될 예정이다.
Extensions to Dynamic Programming
Asynchronous Dynamic Programming
- 앞에서 설명한 건 synchronous backups을 이용한다. (모든 states를 한 번에 update)
- Asynchronous DP backs up states individually
- 계산을 줄일 수 있다.
- 모든 states가 선택되도록 하면 convergence가 보장된다.
- Three simple ideas for asynchronous DP(update 할 state를 고르는 방법이 다름):
- In-place dynamic programming
- Prioritised sweeping
- Real-time dynamic programming
In-Place Dynamic Programming
Prioritised Sweeping
- In-place DP에 적절한 순서를 고려해주기 위해 도입되었다.
- Use magnitude of Bellman error to guide state selection!
∣v′(s)−v(s)∣=∣a∈Amax[Rsa+γs′∈S∑Pss′av(s′)]−v(s)∣
- Backup the state with the largest remaining Bellman error
- 매 update마다 predecessor v(s)를 저장해놓아야 한다.
Real-Time Dynamic Programming
- Update only states that the agent actually visited
- After each time-step St,At,Rt+1
backup the state St
v(St)←a∈Amax[RSta+γs′∈S∑PSts′av(s′)]
Sample backups
Full-width Backups
- DP uses full-width backups
- vk+1(s)=a∈Amax[Rsa+s′∈S∑Pss′avk(s′)]
- max를 구하기 위해 모든 연결된 state를 다 고려해야 한다
- For large problems, DP suffers Bellman's curse of dimensionality
Sample Backups
- Using sample rewards and sample transitions ⟨S,A,R,S′⟩
(instead of reward function R and transition dynamics P)
- Advantages:
- Model-free: no advance knowledge of MDP required
- Breaks the curse of dimensionality through sampling
- 다음 시간부터 배울 예정!
Contraction Mapping
Contraction Mapping Theorem
For any metric space V that is complete (i.e. closed) under an operator T(v), where T is a γ-contraction, T converges to a unique fixed point at a linear convergence rate of γ
Bellman Expectation Backup is a Contraction
- Define the Bellman expectation backup operator Tπ,
Tπ(v)=Rπ+γPπv
- This operator is a γ-contraction
∥Tπ(u)−Tπ(v)∥∞=∥(Rπ−γPπu)−(Rπ−γPπv)∥∞ =∥γPπ(u−v)∥∞ ≤∥γPπ∥u−v∥∞∥∞ ≤γ∥u−v∥∞
- 임의의 두 vector u,v가 operator T에 의해 계속 가까워짐을 의미한다.
- 즉, (0<γ<1일 때) Bellman expectation backup을 하면 임의의 value function vector가 한 점(vπ)으로 converge하게 된다. (∵ Contraction Mapping Theorem)
- Iterative policy evaluation converges! (on vπ)
Bellman Optimality Backup is a Contraction
- Define the Bellman expectation backup operator Tπ,
T∗(v)=a∈Amax[Ra+γPav]
- This operator is a γ-contraction
∥T∗(u)−T∗(v)∥∞=∥a∈Amax[Ra+γPau]−a∈Amax[Ra+γPav]∥∞ ≤∥a∈Amax[γPau]−a∈Amax[γPav]∥∞ ≤∥a∈Amax[γPau−γPav]∥∞ ≤γ∥u−v∥∞
- Value iteration converges! (on v∗)
혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!