벨만 방정식(Bellman equation)

chchch·2021년 11월 8일
0

RL

목록 보기
1/3
post-thumbnail

아래의 내용은 세종대학교 박성수 교수님께서 쓰신 수학으로 풀어보는 강화학습 원리와 알고리즘 책에서 나오는 유도식을 표기법만 달리해서 정리한 내용입니다. 책의 전개와 설명이 좋아서 강화학습에 관심있는 분들은 꼭 구입하셔서 읽어보길 바랍니다.


가치함수

tt시점 이후의 보상 총합을 GtG_t라고 하자.

Gt=r(st,at)+γr(st+1,at+1)+γ2r(st+2,at+2)++γTtr(sT,aT)G_t = r(s_t, a_t) + \gamma r(s_{t+1}, a_{t+1}) + \gamma^2 r(s_{t+2}, a_{t+2}) + \cdots + \gamma^{T-t}r(s_T, a_T)

상태가치 함수(state-value function)

Vπ(st)=Eτat:aTp(τat:aTst)[k=tTγktr(sk,ak)st]=τat:aT(k=tTγktr(sk,ak))p(τat:aTst)dτat:aT\begin{aligned} V^{\pi}(s_t) &= \mathbb{E}_{\tau_{a_t:a_T} \sim p(\tau_{a_t:a_T}|s_t)} \left[ \sum_{k=t}^T \gamma^{k-t} r(s_k, a_k) | s_t\right] \\ &=\int_{\tau_{a_t:a_T}} \left( \sum_{k=t}^T \gamma^{k-t} r(s_k, a_k) \right) p(\tau_{a_t:a_T} | s_t) d\tau_{a_t:a_T} \end{aligned}

여기서, τat:aT\tau_{a_t:a_T}는 정책 π\pi로 생성되는 상태와 행동의 시퀀스 (at,st+1,at+1,,sT,aT)(a_t, s_{t+1}, a_{t+1}, \dots, s_T, a_T)이다. τat:aTπ\tau^{\pi}_{a_t:a_T}로 쓰는게 더 자세할 듯 하다.

행동가치 함수

Qπ(st,at)=Eτst+1:aTp(τst+1:aTst,at)[k=tTγktr(sk,ak)st,at]=τst+1:aT(k=tTγktr(sk,ak))p(τst+1:aTst)dτst+1:aT\begin{aligned} Q^{\pi}(s_t, a_t) &= \mathbb{E}_{\tau_{s_{t+1}:a_T} \sim p(\tau_{s_{t+1}:a_T}|s_t, a_t)} \left[ \sum_{k=t}^T \gamma^{k-t} r(s_k, a_k) | s_t, a_t\right] \\ &=\int_{\tau_{s_{t+1}:a_T}} \left( \sum_{k=t}^T \gamma^{k-t} r(s_k, a_k) \right) p(\tau_{s_{t+1}:a_T} | s_t) d\tau_{s_{t+1}:a_T} \end{aligned}

τat:aT=τatτst+1:aT\tau_{a_t:a_T} = \tau_{a_t} \cup \tau_{s_{t+1} : a_T}로 분해를 하면 chain rule에 의해

p(τat:aT)=p(at,τst+1:aT)=p(τst+1:aTst,at).π(atst)p(\tau_{a_t:a_T}) = p(a_t, \tau_{s_{t+1}:a_T})= p(\tau_{s_{t+1}:a_T}|s_t, a_t). \pi(a_t |s_t)

위 식을 상태가치 함수의 식에 대입을 하고 적분을 나누어서 생각하면

Vπ(st)=τat:aT(k=tTγktr(st,at))p(τat:aTst)dτat:aT=atτst+1:aT(k=tTγktr(sk,ak))p(at,τst+1:aT)dτst+1:aTdat=atτst+1:aT((k=tTγktr(sk,ak))p(τst+1:aTst,at)π(atst)dτst+1:aTdat=at[τst+1:aT((k=tTγktr(sk,ak))p(τst+1:aTst,at)dτst+1:aT]Qπ(st,at)π(atst)dat=atQπ(st,at)π(atst)dat=Eatπ(atst)[Qπ(st,at)]\begin{aligned} V^{\pi}(s_t) &= \int_{\tau_{a_t:a_T}} \left(\sum_{k=t}^T \gamma^{k-t} r(s_t, a_t)\right) p(\tau_{a_t:a_T} | s_t) d\tau_{a_t:a_T} \\ &= \int_{a_t} \int_{\tau_{s_{t+1}:a_T}} \left(\sum_{k=t}^T \gamma^{k-t} r(s_k, a_k) \right) p(a_t, \tau_{s_{t+1}:a_T}) d \tau_{s_{t+1}:a_T} da_t \\ &= \int_{a_t} \int_{\tau_{s_{t+1}:a_T}} \left((\sum_{k=t}^T \gamma^{k-t} r(s_k, a_k) \right) p(\tau_{s_{t+1}:a_T}| s_t, a_t ) \pi(a_t | s_t) d\tau_{s_{t+1}:a_T} d a_t \\ &= \int_{a_t} \underbrace{\left[\int_{\tau_{s_{t+1}:a_T}} \left((\sum_{k=t}^T \gamma^{k-t} r(s_k, a_k) \right) p(\tau_{s_{t+1}:a_T}| s_t, a_t ) d\tau_{s_{t+1}:a_T} \right]}_{Q^{\pi}(s_t, a_t)} \pi(a_t | s_t) d a_t \\ &= \int_{a_t} Q^{\pi}(s_t, a_t) \pi(a_t | s_t) d a_t = \mathbb{E}_{a_t \sim \pi(a_t |s_t)} \left[ Q^{\pi}(s_t, a_t)\right] \end{aligned}

상태가치 함수는 상태 sts_t에서 선택 가능한 모든 행동 ata_t에 대한 기댓값으로 정의할 수 있다.


벨만 방정식

행동가치 함수의 분해

행동가치 함수를 시간구간 [t,t+n1][t, t+n-1]에서 전개

Qπ(st,at)=τst+1:aT(k=tTγktr(st,at))p(τst+1:aTst,at)dτst+1:aT=τst+1:aT(rt+γrt+1++γn1rt+n1+k=t+nTγktr(sk,ak))p(τst+1st,at)dτst+1:aT=τst+1:aT(rt+γrt+1++γn1rt+n1)p(τst+1:aTst,at)dτst+1:aTQ1+τst+1:aT(k=t+nTγktr(st,at))p(τst+1:aTst,at)dτst+1:aTQ2\begin{aligned} Q^{\pi}(s_t, a_t) &= \int_{\tau_{s_{t+1}:a_T}} \left( \sum_{k=t}^T \gamma^{k-t} r(s_t, a_t)\right) p(\tau_{s_{t+1}:a_T} | s_t, a_t) d \tau_{s_{t+1}:a_T} \\ &= \int_{\tau_{s_{t+1}:a_T}} \left(r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \sum_{k=t+n}^T \gamma^{k-t} r(s_k, a_k) \right) p(\tau_{s_{t+1}}| s_t, a_t) d\tau_{s_{t+1}:a_T} \\ &= \underbrace{\int_{\tau_{s_{t+1}:a_T}} (r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1}) p(\tau_{s_{t+1}:a_T} | s_t, a_t) d \tau_{s_{t+1}:a_T}}_{Q_1} + \underbrace{\int_{\tau_{s_{t+1}:a_T}} \left( \sum_{k=t+n}^T \gamma^{k-t} r(s_t, a_t) \right) p(\tau_{s_{t+1}:a_T}|s_t, a_t) d \tau_{s_{t+1}:a_T}}_{Q_2} \end{aligned}

이제 Q1Q_1Q2Q_2를 정리해보자. 먼저, τst+1:aT=τst+1:st+nτat+n:aT\tau_{s_{t+1}:a_T} = \tau_{s_{t+1}:s_{t+n}} \cup \tau_{a_{t+n}:a_T}로 나눌 수 있다. 위에서 한 것과 비슷하게 chain rule에 의해

p(τst+1:aTst,at)=p(τst+1:st+n,τat+n:aTst,at)=p(τat+n:aTst,at,τst+1:st+n)p(τst+1:st+nst,at)p(\tau_{s_{t+1}:a_T} | s_t, a_t) = p(\tau_{s_{t+1}:s_{t+n}}, \tau_{a_{t+n}:a_T} | s_t, a_t) = p(\tau_{a_{t+n}:a_T} | s_t, a_t, \tau_{s_{t+1}:s_{t+n}}) p(\tau_{s_{t+1}:s_{t+n}}| s_t, a_t)

위 식을 Q1Q_1식에 대입하면

Q1=τst+1:aT(rt+γrt+1++γn1rt+n1)p(τst+1:st+nst,at)dτst+1:st+n=τst+1:st+nτat+n:aT(rt+γrt+1++γn1rt+n1)p(τat+n:aTst,at,τst+1:st+n)dτat+n:aTp(τst+1:st+nst,at)dτst+1:st+n=τst+1:st+n[τat+n:aTp(τat+n:aTst,at,τst+1:st+n)dτat+n:aT=1(모든 확률의 합)](rt+γrt+1++γn1rt+n1)p(τst+1:st+nst,at)dτst+1:st+n=τst+1:st+n(rt+γrt+1++γn1rt+n1)p(τst+1:st+nst,at)dτst+1:st+n\begin{aligned} Q_1 &= \int_{\tau_{s_{t+1}:a_T}} (r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1}) p(\tau_{s_{t+1}:s_{t+n}} | s_t, a_t) d\tau_{s_{t+1}:s_{t+n}} \\ &= \int_{\tau_{s_{t+1}:s_{t+n}}} \int_{\tau_{a_{t+n}:a_T}} (r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1}) p(\tau_{a_{t+n}:a_T} | s_t, a_t, \tau_{s_{t+1}:s_{t+n}}) d\tau_{a_{t+n}:a_T} p(\tau_{s_{t+1}:s_{t+n}}| s_t, a_t) d \tau_{s_{t+1}:s_{t+n}} \\ &= \int_{\tau_{s_{t+1}:s_{t+n}}} \Big[ \underbrace{\int_{\tau_{a_{t+n}:a_T}} p(\tau_{a_{t+n}:a_T} | s_t, a_t, \tau_{s_{t+1}:s_{t+n}})d\tau_{a_{t+n}:a_T} }_{=1(\because \text{모든 확률의 합})}\Big](r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1}) p(\tau_{s_{t+1}:s_{t+n}} | s_t, a_t) d \tau_{s_{t+1}:s_{t+n}} \\ &= \int_{\tau_{s_{t+1}:s_{t+n}}} (r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1}) p(\tau_{s_{t+1}:s_{t+n}} | s_t, a_t) d \tau_{s_{t+1}:s_{t+n}} \end{aligned}

Q2Q_2에 대해서 정리를 해보자.

Q2=τst+1:aT(k=t+nTγktr(st,at))p(τst+1:aTst,at)dτst+1:aT=τst+1:st+nγn[τat+n:aT(k=t+nTγktnr(st,at))p(τat+n:aTst,at,τst+1:st+n)dτat+n:aT]p(τst+1:st+nst,at)dτst+1:st+n=τst+1:st+nγn[τat+n:aT(k=t+nTγktnr(st,at))p(τat+n:aTst+n)dτat+n:aT]Vπ(st+n)p(τst+1:st+nst,at)dτst+1:st+n=τst+1:st+nγnVπ(st+n)p(τst+1:st+nst,at)dτst+1:st+n\begin{aligned} Q_2 &= \int_{\tau_{s_{t+1}:a_T}} \left( \sum_{k=t+n}^T \gamma^{k-t} r(s_t, a_t) \right) p(\tau_{s_{t+1}:a_T}|s_t, a_t) d \tau_{s_{t+1}:a_T} \\ &= \int_{\tau_{s_{t+1}:s_{t+n}}}\gamma^n \left[ \int_{\tau_{a_{t+n} : a_T}} \Big( \sum_{k=t+n}^T \gamma^{k-t-n} r(s_t, a_t)\Big) p(\tau_{a_{t+n}:a_T}| s_t, a_t, \tau_{s_{t+1}:s_{t+n}}) d\tau_{a_{t+n}: a_T} \right] p(\tau_{s_{t+1}:s_{t+n}} | s_t, a_t) d \tau_{s_{t+1}:s_{t+n}} \\ &= \int_{\tau_{s_{t+1}:s_{t+n}}}\gamma^n \underbrace{\left[ \int_{\tau_{a_{t+n} : a_T}} \Big( \sum_{k=t+n}^T \gamma^{k-t-n} r(s_t, a_t)\Big) p(\tau_{a_{t+n}:a_T}| s_{t+n}) d\tau_{a_{t+n}: a_T} \right]}_{V^{\pi}(s_{t+n})} p(\tau_{s_{t+1}:s_{t+n}} | s_t, a_t) d \tau_{s_{t+1}:s_{t+n}} \\ &= \int_{\tau_{s_{t+1}:s_{t+n}}}\gamma^n V^{\pi}(s_{t+n}) p(\tau_{s_{t+1}:s_{t+n}} | s_t, a_t) d \tau_{s_{t+1}:s_{t+n}} \end{aligned}

p(τat+n:aTst,at,τst+1:st+n)=p(τat+n:aTst+n)p(\tau_{a_{t+n}:a_T}| s_t, a_t, \tau_{s_{t+1}:s_{t+n}}) = p(\tau_{a_{t+n}:a_T}| s_{t+n})는 Markov process를 가정했기 때문이다.

이제 Q1Q_1Q2Q_2를 더하면 다음과 같은 식을 유도할 수 있다.

Qπ(st,at)=τst+1:st+n[(rt+γrt+1++γn1rt+n1)+γnVπ(st+n)]p(τst+1:st+nst,at)dτst+1:st+n=Eτst+1:st+np(τst+1:st+nst,at)[rt+γrt+1++γn1rt+n1+γnVπ(st+n)]\begin{aligned} Q^{\pi}(s_t, a_t) &= \int_{\tau_{s_{t+1}:s_{t+n}}} [(r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1}r_{t+n-1}) + \gamma^n V^{\pi}(s_{t+n})] p(\tau_{s_{t+1}:s_{t+n}} | s_t, a_t) d \tau_{s_{t+1}:s_{t+n}} \\ &= \mathbb{E}_{\tau_{s_{t+1}:s_{t+n}} \sim p(\tau_{s_{t+1}:s_{t+n}}|s_t, a_t)}[r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1}r_{t+n-1} + \gamma^n V^{\pi}(s_{t+n})] \end{aligned}

상태가치 함수 전개

위의 행동가치 함수를 통해 상태가치 함수를 새롭게 쓸 수 있다.

Vπ(st)=atπ(atst)[τst+1:st+n[rt++γnVπ(st+n)]p(τst+1:st+nst,at)dτst+1:st+n]dat=τat:st+n[rt+γrt+1++γn1rt+n1+γnVπ(st+n)]p(τat:st+nst)dτat:st+n=Eτat:st+np(τat:st+nst)[rt+γrt+1++γn1rt+n1+γnVπ(st+n)]\begin{aligned} V^{\pi}(s_t) &= \int_{a_t} \pi(a_t | s_t) \Big[ \int_{\tau_{s_{t+1}:s_{t+n}}} [r_t + \cdots + \gamma^n V^{\pi}(s_{t+n})] p(\tau_{s_{t+1}:s_{t+n}}|s_t, a_t)d\tau_{s_{t+1}:s_{t+n}} \Big] d a_t \\ &= \int_{\tau_{a_t: s_{t+n}}} [r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1}r_{t+n-1} + \gamma^n V^{\pi}(s_{t+n})] p(\tau_{a_t:s_{t+n}}|s_t) d\tau_{a_t:s_{t+n}} \\ &= \mathbb{E}_{\tau_{a_t:s_{t+n}} \sim p(\tau_{a_t:s_{t+n}}|s_t)}[r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V^{\pi}(s_{t+n})] \end{aligned}

여기서 p(τat:st+nst)=π(atst)p(τst+1:st+nst,at)p(\tau_{a_t:s_{t+n}}|s_t) = \pi(a_t | s_t) p(\tau_{s_{t+1}:s_{t+n}}|s_t, a_t).

벨만 방정식

시간구간을 나눈는 기준을 n=1n=1로 했을 경우에 tt시점의 상태가치 함수를 다음과 같이 쓸 수 있다.

Vπ(st)=atπ(atst)[st+1[r(st,at)+γVπ(st+1)]p(st+1st,at)dst+1]dat=Eatπ(atst)[Est+1p(st+1st,at)[r(st,at)+γVπ(st+1)]]=Eatπ(atst)[r(st,at)+Est+1p(st+1st,at)[γVπ(st+1)]]\begin{aligned} V^{\pi}(s_t) &= \int_{a_t} \pi(a_t|s_t) \Big[ \int_{s_{t+1}} [r(s_t, a_t) + \gamma V^{\pi}(s_{t+1})] p(s_{t+1}|s_t, a_t) d s_{t+1}\Big] d a_t \\ &= \mathbb{E}_{a_t \sim \pi(a_t|s_t)} [\mathbb{E}_{s_{t+1} \sim p(s_{t+1}|s_t, a_t)}[r(s_t, a_t) + \gamma V^{\pi}(s_{t+1})]] \\ &= \mathbb{E}_{a_t \sim \pi(a_t|s_t)}[r(s_t, a_t) + \mathbb{E}_{s_{t+1} \sim p(s_{t+1}|s_t, a_t)}[\gamma V^{\pi}(s_{t+1})]] \end{aligned}

위 식을 벨만 방정식이라고 하며 벨만 방정식은 tt시점의 상태가치 함수를 이후의 상태가치 함수로 표현한 관계식이다.

참고자료

  • Richard S. Sutton and Andrew G. Barto, Reinforcement Learning
  • 박성수, 수학으로 풀어보는 강화학습 원리와 알고리즘, 위키북스
profile
머신러닝과 통계학을 공부하는 사람

0개의 댓글