2.3 Bellman equation

Tommy Kim·2023년 9월 9일
0

Bellman equation is about method of representing state value function and action value function.

Bellman equation for value state function

Let’s change the form of state value function(using bayesian rule).
Recall Bayesian rule:
p(x,y)=p(xy)p(y)p(x,y) = p(x|y)p(y)
p(x,yz)=p(xy,z)p(yz)p(x,y|z) = p(x|y,z)p(y|z)
The state value function becomes:

V(st)=at:aGtp(at,st+1,at+1,st)dat:a=atst+1:aGtp(st+1,at+1,st,at)dst+1:ap(atst)dat\begin{aligned} V(s_t) &= \int\limits_{a_t:a_\infin} G_t p(a_t, s_{t+1}, a_{t+1},… | s_t) da_t:a_\infin \\ &=\int\limits_{a_t} \colorbox{aqua}{$\int\limits_{s_{t+1}:a_\infin} G_t p(s_{t+1}, a_{t+1}, …| s_t, a_t)ds_{t+1}:a_\infin$} p(a_t|s_t)da_t \end{aligned}

We can find the blue box is actually action value function.
So the state value function becomes:
V(st)=atQ(st,at)p(atst)datV(s_t) = \int\limits_{a_t} Q(s_t, a_t) p(a_t|s_t)da_t
For example, in the Q-learning, the state value function is expectation of all Q in the current state.

Let's change the equation in different way.

V(st)=at:aGtp(at,st+1,at+1,st)dat:a=at,st+1at+1:aGtp(at+1,st,at,st+1)dst+1:ap(at,st+1st)dat,st+1\begin{aligned} V(s_t) &= \int\limits_{a_t:a_\infin} G_t p(a_t, s_{t+1}, a_{t+1},… | s_t) da_t:a_\infin \\ &=\int\limits_{a_t, s_{t+1}} \int\limits_{a_{t+1}:a_\infin} G_t p(a_{t+1}, …| \colorbox{aqua}{$s_t, a_t$},s_{t+1})ds_{t+1}:a_\infin p(a_t,s_{t+1}|s_t)da_t,s_{t+1} \end{aligned}

As we learned, if we already have information about st+1s_{t+1}, we don't need st,ats_t, a_t anymore.
GtG_t can be expressed as Rt+γRt+1R_t + \gamma R_{t+1}Therefore, the equation becomes:
V(st)=at,st+1at+1:a(Rt+γGt+1)p(at+1,...st+1)dat+1:ap(at,st+1st)dat,st+1V(s_t)\\=\int\limits_{a_t,s_{t+1}} \int\limits_{a_{t+1}:a_\infin} (R_t + \gamma G_{t+1}) p(a_{t+1},...|s_{t+1})da_{t+1}:a_\infin p(a_t, s_{t+1}|s_t)da_t,s_{t+1}

If we look at the equation, we know that
at+1:aGt+1p(at+1,...st+1)dat+1:a=V(st+1)\int\limits_{a_{t+1}:a_\infin} G_{t+1} p(a_{t+1},...|s_{t+1})da_{t+1}:a_\infin = V(s_{t+1})
The final equation is:
V(st)=at,st+1(Rt+γV(st+1))p(at,st+1st)dat,st+1=at,st+1(Rt+γV(st+1))p(st+1st,at)p(atst)dat,st+1\begin{aligned}V(s_t)&=\int\limits_{a_t,s_{t+1}} (R_t + \gamma V(s_{t+1})) p(a_t, s_{t+1}|s_t)da_t,s_{t+1}\\ &=\int\limits_{a_t,s_{t+1}} (R_t + \gamma V(s_{t+1})) \colorbox{lightgreen}{$p(s_{t+1}|s_t,a_t)$}\colorbox{aqua}{$p(a_t|s_t)$}da_t,s_{t+1}\end {aligned}
This equation form has an advantage in being able to see the policy(blue box). The green box is called transition probability.

Bellman equation for action value function

Now, we will change the form of action value function.

Q(st,at)=st+1:aGtp(st+1,at+1,st,at)dst+1:a=st+1at+1:aGtp(at+1,st,at,st+1)dat+1:ap(st+1st,at)dst+1=st+1at+1:a(Rt+γGt+1)p(at+1,st+1)dat+1:ap(st+1st,at)dst+1=st+1(Rt+γV(st+1))p(st+1st,at)dst+1\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1}:a_\infin} G_t p( s_{t+1}, a_{t+1},… | s_t,a_t) ds_{t+1}:a_\infin \\ &=\int\limits_{s_{t+1}} \int\limits_{a_{t+1}:a_\infin} \colorbox{yellow}{$G_t$} p(a_{t+1}, …| \colorbox{aqua}{$s_t, a_t$},s_{t+1})da_{t+1}:a_\infin p(s_{t+1}|s_t,a_t)ds_{t+1}\\ &=\int\limits_{s_{t+1}} \int\limits_{a_{t+1}:a_\infin} (\colorbox{yellow}{$R_t + \gamma G_{t+1}$}) p(a_{t+1}, …| s_{t+1})da_{t+1}:a_\infin p(s_{t+1}|s_t,a_t)ds_{t+1}\\ &=\int\limits_{s_{t+1}}(R_t + \gamma V(s_{t+1}))p(s_{t+1}|s_t,a_t)ds_{t+1} \end{aligned}

We have applied same principle to changing equation. As we can see the equation, we can express Q with V(st+1)V(s_{t+1}).

Let's change the action value function again.

Q(st,at)=st+1:aGtp(st+1,at+1,st,at)dst+1:a=st+1,at+1at+1:aGtp(st+2,st,at,st+1,at+1)dst+2:ap(st+1,at+1st,at)dst+1,at+1=st+1,at+1at+1:a(Rt+γGt+1)p(st+2,st+1,at+1)dst+2:ap(st+1,at+1st,at)dst+1,at+1=st+1,at+1(Rt+γQ(st+1,at+1))p(st+1,at+1st,at)dst+1,at+1=st+1,at+1(Rt+γQ(st+1,at+1))p(at+1st,at,st+1)p(st+1st,at)dst+1,at+1=st+1,at+1(Rt+γQ(st+1,at+1))p(at+1st+1)p(st+1st,at)dst+1,at+1\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1}:a_\infin} G_t p( s_{t+1}, a_{t+1},… | s_t,a_t) ds_{t+1}:a_\infin \\ &=\int\limits_{s_{t+1},a_{t+1}} \int\limits_{a_{t+1}:a_\infin} \colorbox{yellow}{$G_t$} p(s_{t+2}, …| \colorbox{aqua}{$s_t, a_t$},s_{t+1},a_{t+1})ds_{t+2}:a_\infin p(s_{t+1},a_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &=\int\limits_{s_{t+1},a_{t+1}} \int\limits_{a_{t+1}:a_\infin} (\colorbox{yellow}{$R_t + \gamma G_{t+1}$}) p(s_{t+2}, …| s_{t+1},a_{t+1})ds_{t+2}:a_\infin p(s_{t+1},a_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &=\int\limits_{s_{t+1},a_{t+1}}(R_t + \gamma Q_(s_{t+1}, a_{t+1}))p(s_{t+1},a_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &=\int\limits_{s_{t+1},a_{t+1}}(R_t + \gamma Q_(s_{t+1}, a_{t+1}))\colorbox{aqua}{$p(a_{t+1}|s_t,a_t,s_{t+1})$}p(s_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &=\int\limits_{s_{t+1},a_{t+1}}(R_t + \gamma Q_(s_{t+1}, a_{t+1}))\colorbox{aqua}{$p(a_{t+1}|s_{t+1})$}\colorbox{lightgreen}{$p(s_{t+1}|s_t,a_t)$}ds_{t+1},a_{t+1}\\ \end{aligned}

Again, by changing the equation, we reduced a calculation for integration, and also found the policy(blue box)and the transition probability(green box) in the equation.

profile
I’m interested in artificial intelligence

0개의 댓글