Silver RL (3) Planning by Dynamic Programming

Sanghyeok Choi·2021년 12월 23일
0

Intro_to_RL

목록 보기
3/9

David Silver 교수님의 Introduction to Reinforcement Learning (Website)
Lecture 3: Planning by Dynamic Programming (Youtube) 강의 내용을 정리했습니다.

Introduction

What is Dynamic Programming (DP)?

  • Dynamic: sequential or temporal component to the problem
  • Programming: optimizing a "program", i.e. a policy
  • Dynamic Programming: Optimization for sequential problem by breaking them down into subproblems
  • RL \subset DP

Requirements for Dynamic Programming

  • 1) Optimal substructure
    • Principle of optimality applies
    • Principle of optimality: optimal solution can be decomposed into subproblems
  • 2) Overlapping subproblems
    • Subproblems recur many times
    • Solutions can be cached and reused
  • MDP는 위 두 조건을 만족한다.
    • Bellman equation gives recursive decomposition
      v(s)=E[Rt+1+γv(st+1)St=s]v(s)=\mathbb{E}[R_{t+1}+\gamma{v(s_{t+1})|S_t=s}]
    • Value function stores and reuses solutions

Planning by Dynamic Programming

  • Recall, planning is not a 'full' RL problem.
    • Planning: The environment is fully known
    • Reinforcement Learning: The environment is initially unknown
  • DP assumes full knowledge of the MDP, and used for planning in the MDP
    • MDP를 안다 <=> Environment를 안다
  • Prediction = (Policy) Evaluation
    • Policy가 정해져 있을 때 그 policy가 얼마나 좋은지 평가한다.
    • Input: MDP S,A,P,R,γ\lang \mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma \rang and policy π\pi
          or: MRP S,Pπ,Rπ,γ\lang \mathcal{S},\mathcal{P}^\pi,\mathcal{R}^\pi,\gamma\rang
    • Output: value function vπv_\pi
  • Control
    • Optimal policy를 찾는다.
    • Input: MDP S,A,P,R,γ\lang \mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma \rang
    • Output: optimal value function vv_* and optimal policy π\pi_*
  • "Solving MDP"라고 하면 보통 Control 문제를 푸는 걸 의미함

Policy Evaluation

Iterative Policy Evaluation

  • Evaluate a given policy π\pi by iterative application of Bellman expectation backup
    • State-value function, vπ\mathrm{v}_\pi를 구하는 과정
    • Recall, Bellman expectation eqn':
      vπ(s)=Eπ[Rt+1+γvπ(St+1)St=s]          =aAπ(as)qπ(s,a)          =aAπ(as)(Rsa+γsSPssavπ(s))v_{\pi}(s)=\mathbb{E}_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)q_{\pi}(s,a)}\\ \space\space\space\space\space\space\space\space\space\space=\sum\limits_{a\in\mathcal{A}}{\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^{a}_{ss'}v_{\pi}(s'))}
    • In matrix form, vπ=Rπ+γPπvπ\mathrm{v}_\pi=\mathcal{R}^\pi+\gamma\mathcal{P}^\pi\mathrm{v}_\pi
      => Direct Solution: vπ=(IγPπ)1Rπ\mathrm{v}_\pi=(I-\gamma\mathcal{P}^\pi)^{-1}\mathcal{R}^\pi
      (inverse의 계산복잡도가 너무 커서 DP를 이용해 vπ\mathrm{v}_\pi를 구한다!)
  • v1v2v3...vπ\mathrm{v}_1\to\mathrm{v}_2\to\mathrm{v}_3\to...\to\mathrm{v}_\pi
    Arbitrary vector v1\mathrm{v}_1을 iteratively update, vπ\mathrm{v}_\pi로 수렴!
    Note: vk=[vk(1),vk(2),...,vk(n)]T\mathrm{v}_k=[v_k(1), v_k(2), ..., v_k(n)]^T
  • vk+1(s)=aAπ(as)(Rsa+γsSPssavk(s))v_{k+1}(s)=\sum\limits_{a\in\mathcal{A}}\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_k(s')})
    • Bellman equation을 iteratively 적용해서 vk\mathrm{v}_kvk+1\mathrm{v}_{k+1}을 구할 수 있다!
    • 계속 하다 보면 vπ\mathrm{v}_\pi로 수렴한다. (증명은 뒤에서 할 예정)

Example: Iterative Policy Evaluation in Small Gridworld

  • 규칙
    • Agent follows uniform random policy (단, grid 밖으로는 움직일 수 없다.)
    • 좌상단, 우하단 grid에 도착하면 terminate
    • Reward is -1 every step until the terminal state is reached
      (가능한 빨리 terminal state에 도착하는 게 좋다)
    • γ=1.0\gamma=1.0

Image from: here

  • 빨간색 동그라미 표시한 곳에 위치하는 state를 s라고 하면,
    v2(s)=Rπ+γsSPssπv1(s)          =(1.0)+(1.0)×[13×0+23×(1.0)]          =1.666...1.7v_2(s)=\mathcal{R}^\pi+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^\pi_{ss'}v_1(s')\\ \space\space\space\space\space\space\space\space\space\space=(-1.0)+(1.0)\times[\frac{1}{3}\times0+\frac{2}{3}\times(-1.0)]\\ \space\space\space\space\space\space\space\space\space\space=-1.666... \approx-1.7

Policy Iteration

Policy Iteration

  • Given a policy π\pi
    • Evaluate the policy π\pi => vπ(s)\mathrm{v}_\pi(s)
    • Improve the policy by acting greedily w.r.t vπ\mathrm{v}_\pi => π=greedy(vπ)\pi'=greedy(\mathrm{v}_\pi)
  • 이 과정을 반복하면 π\pi^*으로 수렴한다.

Image from: here

  • 1) Policy evaluation : Itrative policy evaluation
    2) Policy improvement : Greedy policy improvement

Policy Improvement

  • First, consider a deterministic policy, a=π(s)\pi(s) ... state s에서는 action a를 한다.
  • By acting greedily,
    π(s)=arg maxaAqπ(s,a)=arg maxaA[Rsa+γsSPssavπ(s)]=a\pi'(s)=\argmax\limits_{a\in\mathcal{A}}{q_\pi(s,a)}=\argmax\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_\pi(s')]=a_*
  • Then,
    qπ(s,π(s))=qπ(s,a)=maxaAqπ(s,a)qπ(s,π(s))=vπ(s)q_\pi(s,\pi'(s))=q_\pi(s,a_*)=\max\limits_{a\in\mathcal{A}}{q_\pi(s,a)} \geq q_\pi(s,\pi(s))=v_\pi(s)
    Note: 부등식 좌변은 s에서 한 번만 π\pi'으로 action하고 그 이후로는 π\pi를 따를 때의 value
  • It therefore improves the value function
    vπ(s)qπ(s,π(s))=Eπ[Rt+1+γvπ(St+1)St=s]          Eπ[Rt+1+γqπ(St+1,π(St+1))St=s]          Eπ[Rt+1+γRt+2+γ2qπ(St+2,π(St+2)St=s]          Eπ[Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+...St=s]=vπ(s)v_\pi(s) \leq q_\pi(s,\pi'(s))=\mathbb{E}_{\pi'}[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma q_\pi(S_{t+1}, \pi'(S_{t+1}))|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2q_\pi(S_{t+2},\pi'(S_{t+2})|S_t=s]\\ \space\space\space\space\space\space\space\space\space\space\leq \mathbb{E}_{\pi'}[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\gamma^3R_{t+4}+...|S_t=s]=v_{\pi'}(s)
    i.e., greedily updated new policy is at least equal to previous policy!
  • If improvements stop, i.e.,
    qπ(s,π(s))=maxaAqπ(s,a)=qπ(s,π(s))=vπ(s),sSq_\pi(s,\pi'(s))=\max\limits_{a\in\mathcal{A}}{q_\pi(s,a)}=q_\pi(s,\pi(s))=v_\pi(s), \forall s\in\mathcal{S}
    즉, improvement를 해도 action-value가 그대로인 경우 (e.g. π(s)=π(s))\pi'(s)=\pi(s))
    then the Bellman optimality equation has been satisfied, so π\pi is an optimal policy, i.e.,
    vπ(s)=maxaAqπ(s,a),sS    vπ(s)=v(s),sSv_\pi(s)=\max\limits_{a\in\mathcal{A}}{q_\pi(s,a)}, \forall s\in\mathcal{S} \implies v_\pi(s)=v_*(s), \forall s\in\mathcal{S}
    • Recall, optimal policy의 필요충분조건 (2장)
      vπ(s)=maxaAqπ(s,a),sS    πv_\pi(s)=\max\limits_{a\in\mathcal{A}}q_\pi(s,a), \forall s\in\mathcal{S} \iff \pi is an optimal policy

Generalized Policy Iteration

  • 1) Policy evaluation : Any policy evaluation algorithm
    2) Policy improvement : Any policy improvement algorithm

  • Policy evaluation를 vπ\mathrm{v}_\pi에 수렴할 때까지 할 필요는 없다!
    => stop after k iterations of iterative policy evaluation!
    => 참고로 k=1일때 value iteration이 됨! (v\mathrm{v}를 update 할 때마다 policy도 update)

Value Iteration

Principle of Optimality Theorem

  • An optimal solution can be decomposed into subproblems!

A policy π(as)\pi(a|s) achieves the optimal value from state s, i.e., vπ(s)=v(s)v_\pi(s)=v_*(s), if and only if:
for any state ss' reachable from ss, π\pi achieves the optimal value from state ss', i.e., vπ(s)=v(s)v_\pi(s')=v_*(s')
=> 모든 가능한 next states에서 optimal인 policy는 현재 state에서도 optimal이다. (An optimal first action을 찾을 수 있음)
=> 마찬가지로 π\pi가 s에서 optimal이기 위해서는 π\piss'에서도 optimal이어야 한다. (vπ(s)v_\pi(s)에 future reward도 포함되기 때문)

  • c.f. Bellman's Principle of Optimality:
    An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

Deterministic Value Iteration

  • Principle of Optimality Theorem의     \impliedby 방향을 이용한다.
    • We know the solution to subproblems v(s),sSv_*(s'), \forall s' \in S' ... (SS' is a set of successor states of ss)
    • Then solution v(s)v_*(s) can be found by one-step lookahead.
      v(s)maxaA[Rsa+γsSPssav(s)]v_*(s)\gets\max\limits_{a\in\mathcal{A}}{[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_*(s')}]}
  • Find optimal policy by iterative application of Bellman optimality backup
    • Arbitrarily initialize v1\mathrm{v}_1
    • vk+1(s)maxaA[Rsa+γsSPssavk(s)]v_{k+1}(s) \gets \max\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_k(s')]
      이걸 모든 state에 대해 동시에, iteratively 적용하면 v\mathrm{v}_*로 수렴한다.
      In a matrix form, vk+1=maxaA[Ra+γPavk]\mathrm{v}_{k+1}=\max\limits_{a\in\mathcal{A}}{[\mathcal{R}^a+\gamma\mathcal{P}^a\mathrm{v}_k]}
    • v1v2...v\mathrm{v}_1 \to \mathrm{v}_2 \to ... \to \mathrm{v_*}
    • Unlike policy iteration, there is no explicit policy
      update 할 때 max를 취하기 때문에 매 step마다 greedy action을 선택한다고 볼 수 있다.
      (=Policy iteration with 1 step policy evaluation)
    • v\mathrm{v}_*로 greedy policy를 구하면 optimal policy가 된다.
      π(s)=arg maxaAq(s,a)=arg maxaA[Rsa+γsSPssav(s)]\pi^*(s)=\argmax\limits_{a\in\mathcal{A}}{q_*(s,a)}=\argmax\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_*(s')]

Policy Iteration vs Value Iteration

  • Policy iteration:
    • Policy evaluation
      vk+1(s)=aAπ(as)(Rsa+γsSPssavk(s))v_{k+1}(s)=\sum\limits_{a\in\mathcal{A}}\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_k(s')})
      v1v2...vπ\mathrm{v}_1 \to \mathrm{v}_2 \to ... \to \mathrm{v_\pi}
    • Policy improvement
      π(s)=arg maxaAqπ(s,a)=arg maxaA[Rsa+γsSPssavπ(s)]=a\pi'(s)=\argmax\limits_{a\in\mathcal{A}}{q_\pi(s,a)}=\argmax\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_\pi(s')]=a_*
  • Value iteration:
    • vk+1(s)maxaA[Rsa+γsSPssavk(s)]v_{k+1}(s) \gets \max\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_k(s')]
    • v1v2...v\mathrm{v}_1 \to \mathrm{v}_2 \to ... \to \mathrm{v_*}
  • Show policy iteration with 1 step policy evaluation = value iteration:
    • 1 step policy evaluation
      vk(s)=aAπ(as)(Rsa+γsSPssavk1(s))v_{k}(s)=\sum\limits_{a\in\mathcal{A}}\pi(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_{k-1}(s')})
    • Policy improvement
      π(s)=arg maxaAqπ(s,a)=arg maxaA[Rsa+γsSPssavk(s)]=a\pi'(s)=\argmax\limits_{a\in\mathcal{A}}{q_\pi(s,a)}=\argmax\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_{k}(s')]=a_*
    • Again, 1 step policy evaluation:
      vk+1(s)=aAπ(as)(Rsa+γsSPssavk(s))              =Rsa+γsSPssavk(s)              =maxaA[Rsa+γsSPssavk(s)]v_{k+1}(s)=\sum\limits_{a\in\mathcal{A}}\pi'(a|s)(\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{ss'}v_k(s')})\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\mathcal{R}^{a_*}_s+\gamma\sum\limits_{s\in\mathcal{S}}{\mathcal{P}^{a_*}_{ss'}v_k(s')}\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space=\max\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_k(s')]

Example: Shortest Path

Image from: here

  • 빨간색, 파란색 동그라미 표시한 곳에 위치하는 state를 각각 srs_r, sbs_b 라고 하면,
    • v3(sr)=maxa[Rsra+γsSPsrsav2(s)]=1+1.0×0=1v_{3}(s_r)=\max\limits_{a}{[\mathcal{R}^a_{s_r}+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{s_rs'}v_2(s')}]}=-1+1.0\times0=-1
    • v3(sb)=maxa[Rsba+γsSPsbsav2(s)]=1+1.0×(1)=2v_{3}(s_b)=\max\limits_{a}{[\mathcal{R}^a_{s_b}+\gamma\sum\limits_{s'\in\mathcal{S}}{\mathcal{P}^a_{s_bs'}v_2(s')}]}=-1+1.0\times(-1)=-2

Summary of DP Algorithms

Image from: here

  • 지금까지는 state-value function으로 계산했다. (vπ(s)v_\pi(s) or v(s)v_*(s))
    • Complexity: O(mn2)O(mn^2) per iteration, for m actions and n states
  • 똑같은 logic을 action-value function에 적용할 수도 있다. (qπ(s,a)q_\pi(s,a) or q(s,a)q_*(s,a))
    • Complexity: O(m2n2)O(m^2n^2) per iteration
    • 나중에 이 방법을 쓰게 될 예정이다.

Extensions to Dynamic Programming

Asynchronous Dynamic Programming

  • 앞에서 설명한 건 synchronous backups을 이용한다. (모든 states를 한 번에 update)
  • Asynchronous DP backs up states individually
    • 계산을 줄일 수 있다.
    • 모든 states가 선택되도록 하면 convergence가 보장된다.
  • Three simple ideas for asynchronous DP(update 할 state를 고르는 방법이 다름):
    • In-place dynamic programming
    • Prioritised sweeping
    • Real-time dynamic programming

In-Place Dynamic Programming

  • Synchronous value iteration:

    until v\mathrm{v} converges
        for ss in S\mathcal{S}
            vnew(s)maxaA[Rsa+γsSPssavold(s)]v_{new}(s) \gets \max\limits_{a\in\mathcal{A}}{[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_{old}(s')]}
        voldvnew\mathrm{v}_{old} \gets \mathrm{v}_{new}

  • In-place value iteration

    until v\mathrm{v} converges
        for ss in S\mathcal{S}
            v(s)maxaA[Rsa+γsSPssav(s)]v(s) \gets \max\limits_{a\in\mathcal{A}}{[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v(s')]}

    • Buffer를 따로 두지 않고 하나의 v\mathrm{v}를 모든 states에 대해 계속 update
    • Update를 하는 states의 순서에 따라 아무 update도 없을 수도 있다! (states의 순서가 중요)
    • In general, converge faster!

Prioritised Sweeping

  • In-place DP에 적절한 순서를 고려해주기 위해 도입되었다.
  • Use magnitude of Bellman error to guide state selection!
    v(s)v(s)=maxaA[Rsa+γsSPssav(s)]v(s)|v'(s)-v(s)|=|\max\limits_{a\in\mathcal{A}}{[\mathcal{R}^a_s+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v(s')]} - v(s)|
  • Backup the state with the largest remaining Bellman error
  • 매 update마다 predecessor v(s)\mathrm{v}(s)를 저장해놓아야 한다.

Real-Time Dynamic Programming

  • Update only states that the agent actually visited
  • After each time-step St,At,Rt+1S_t, A_t, R_{t+1}
    backup the state StS_t
    v(St)maxaA[RSta+γsSPStsav(s)]v(S_t) \gets \max\limits_{a\in\mathcal{A}}[\mathcal{R}^a_{S_t}+\gamma\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{S_ts'}v(s')]

Sample backups

Full-width Backups

  • DP uses full-width backups
    • vk+1(s)=maxaA[Rsa+sSPssavk(s)]v_{k+1}(s)=\max\limits_{a\in\mathcal{A}}[\mathcal{R}^a_s+\sum\limits_{s'\in\mathcal{S}}\mathcal{P}^a_{ss'}v_k(s')]
  • max를 구하기 위해 모든 연결된 state를 다 고려해야 한다
  • For large problems, DP suffers Bellman's curse of dimensionality

Sample Backups

  • Using sample rewards and sample transitions S,A,R,S\lang S, A, R, S'\rang
    (instead of reward function R\mathcal{R} and transition dynamics P\mathcal{P})
  • Advantages:
    • Model-free: no advance knowledge of MDP required
    • Breaks the curse of dimensionality through sampling
  • 다음 시간부터 배울 예정!

Contraction Mapping

Contraction Mapping Theorem

For any metric space V\mathcal{V} that is complete (i.e. closed) under an operator T(v)T(\mathrm{v}), where TT is a γ\gamma-contraction, TT converges to a unique fixed point at a linear convergence rate of γ\gamma

Bellman Expectation Backup is a Contraction

  • Define the Bellman expectation backup operator TπT^\pi,
    Tπ(v)=Rπ+γPπvT^\pi(\mathrm{v})=\mathcal{R}^\pi+\gamma\mathcal{P}^\pi\mathrm{v}
  • This operator is a γ\gamma-contraction
    Tπ(u)Tπ(v)=(RπγPπu)(RπγPπv)                                  =γPπ(uv)                                  γPπuv                                  γuv\| T^\pi(\mathrm{u})-T^\pi(\mathrm{v}) \|_\infin = \| (\mathcal{R}^\pi-\gamma\mathcal{P}^\pi\mathrm{u}) - (\mathcal{R}^\pi-\gamma\mathcal{P}^\pi \mathrm{v})\|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space=\| \gamma\mathcal{P}^\pi(\mathrm{u}-\mathrm{v})\|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\leq\| \gamma\mathcal{P}^\pi \|\mathrm{u}-\mathrm{v}\|_\infin \|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\leq \gamma\|\mathrm{u}-\mathrm{v}\|_\infin
  • 임의의 두 vector u,v\mathrm{u}, \mathrm{v}가 operator TT에 의해 계속 가까워짐을 의미한다.
  • 즉, (0<γ<10<\gamma<1일 때) Bellman expectation backup을 하면 임의의 value function vector가 한 점(vπ\mathrm{v}_\pi)으로 converge하게 된다. (\because Contraction Mapping Theorem)
  • Iterative policy evaluation converges! (on vπ\mathrm{v}_\pi)

Bellman Optimality Backup is a Contraction

  • Define the Bellman expectation backup operator TπT^\pi,
    T(v)=maxaA[Ra+γPav]T^*(\mathrm{v})=\max\limits_{a\in\mathcal{A}}[\mathcal{R}^a+\gamma\mathcal{P}^a\mathrm{v}]
  • This operator is a γ\gamma-contraction
    T(u)T(v)=maxaA[Ra+γPau]maxaA[Ra+γPav]                                 maxaA[γPau]maxaA[γPav]                                 maxaA[γPauγPav]                                 γuv\| T^*(\mathrm{u})-T^*(\mathrm{v}) \|_\infin=\| \max\limits_{a\in\mathcal{A}}[\mathcal{R}^a+\gamma\mathcal{P}^a\mathrm{u}] - \max\limits_{a\in\mathcal{A}}[\mathcal{R}^a+\gamma\mathcal{P}^a\mathrm{v}] \|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\leq\| \max\limits_{a\in\mathcal{A}}{[\gamma\mathcal{P}^a\mathrm{u}]} - \max\limits_{a\in\mathcal{A}}{[\gamma\mathcal{P}^a\mathrm{v}]} \|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\leq\| \max\limits_{a\in\mathcal{A}}{[\gamma\mathcal{P}^a\mathrm{u}-\gamma\mathcal{P}^a\mathrm{v}]} \|_\infin\\ \space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\leq\gamma\|\mathrm{u}-\mathrm{v}\|_\infin\\
  • Value iteration converges! (on v\mathrm{v}_*)

혹시 오타나 잘못된 부분이 있다면 댓글로 알려주시면 감사하겠습니다!

profile
Lazy Enthusiast

0개의 댓글

관련 채용 정보