2.3 Bellman equation

Tommy Kim·2023년 9월 9일

Reinforcement Learning - hyukppenheim youtube

목록 보기

4/13

Bellman equation is about method of representing state value function and action value function.

Bellman equation for value state function

Let’s change the form of state value function(using bayesian rule).
Recall Bayesian rule:
$p(x,y) = p(x|y)p(y)$
$p(x,y|z) = p(x|y,z)p(y|z)$
The state value function becomes:

\begin{aligned} V(s_t) &= \int\limits_{a_t:a_\infin} G_t p(a_t, s_{t+1}, a_{t+1},… | s_t) da_t:a_\infin \\ &=\int\limits_{a_t} \colorbox{aqua}{$\int\limits_{s_{t+1}:a_\infin} G_t p(s_{t+1}, a_{t+1}, …| s_t, a_t)ds_{t+1}:a_\infin$} p(a_t|s_t)da_t \end{aligned}

We can find the blue box is actually action value function.
So the state value function becomes:
$V(s_t) = \int\limits_{a_t} Q(s_t, a_t) p(a_t|s_t)da_t$
For example, in the Q-learning, the state value function is expectation of all Q in the current state.

Let's change the equation in different way.

\begin{aligned} V(s_t) &= \int\limits_{a_t:a_\infin} G_t p(a_t, s_{t+1}, a_{t+1},… | s_t) da_t:a_\infin \\ &=\int\limits_{a_t, s_{t+1}} \int\limits_{a_{t+1}:a_\infin} G_t p(a_{t+1}, …| \colorbox{aqua}{$s_t, a_t$},s_{t+1})ds_{t+1}:a_\infin p(a_t,s_{t+1}|s_t)da_t,s_{t+1} \end{aligned}

As we learned, if we already have information about $s_{t+1}$ , we don't need $s_t, a_t$ anymore.
$G_t$ can be expressed as $R_t + \gamma R_{t+1}$ Therefore, the equation becomes:
$V(s_t)\\=\int\limits_{a_t,s_{t+1}} \int\limits_{a_{t+1}:a_\infin} (R_t + \gamma G_{t+1}) p(a_{t+1},...|s_{t+1})da_{t+1}:a_\infin p(a_t, s_{t+1}|s_t)da_t,s_{t+1}$

If we look at the equation, we know that
$\int\limits_{a_{t+1}:a_\infin} G_{t+1} p(a_{t+1},...|s_{t+1})da_{t+1}:a_\infin = V(s_{t+1})$
The final equation is:
$\begin{aligned}V(s_t)&=\int\limits_{a_t,s_{t+1}} (R_t + \gamma V(s_{t+1})) p(a_t, s_{t+1}|s_t)da_t,s_{t+1}\\ &=\int\limits_{a_t,s_{t+1}} (R_t + \gamma V(s_{t+1})) \colorbox{lightgreen}{$p(s_{t+1}|s_t,a_t)$}\colorbox{aqua}{$p(a_t|s_t)$}da_t,s_{t+1}\end {aligned}$
This equation form has an advantage in being able to see the policy(blue box). The green box is called transition probability.

Bellman equation for action value function

Now, we will change the form of action value function.

\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1}:a_\infin} G_t p( s_{t+1}, a_{t+1},… | s_t,a_t) ds_{t+1}:a_\infin \\ &=\int\limits_{s_{t+1}} \int\limits_{a_{t+1}:a_\infin} \colorbox{yellow}{$G_t$} p(a_{t+1}, …| \colorbox{aqua}{$s_t, a_t$},s_{t+1})da_{t+1}:a_\infin p(s_{t+1}|s_t,a_t)ds_{t+1}\\ &=\int\limits_{s_{t+1}} \int\limits_{a_{t+1}:a_\infin} (\colorbox{yellow}{$R_t + \gamma G_{t+1}$}) p(a_{t+1}, …| s_{t+1})da_{t+1}:a_\infin p(s_{t+1}|s_t,a_t)ds_{t+1}\\ &=\int\limits_{s_{t+1}}(R_t + \gamma V(s_{t+1}))p(s_{t+1}|s_t,a_t)ds_{t+1} \end{aligned}

We have applied same principle to changing equation. As we can see the equation, we can express Q with $V(s_{t+1})$ .

Let's change the action value function again.

\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1}:a_\infin} G_t p( s_{t+1}, a_{t+1},… | s_t,a_t) ds_{t+1}:a_\infin \\ &=\int\limits_{s_{t+1},a_{t+1}} \int\limits_{a_{t+1}:a_\infin} \colorbox{yellow}{$G_t$} p(s_{t+2}, …| \colorbox{aqua}{$s_t, a_t$},s_{t+1},a_{t+1})ds_{t+2}:a_\infin p(s_{t+1},a_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &=\int\limits_{s_{t+1},a_{t+1}} \int\limits_{a_{t+1}:a_\infin} (\colorbox{yellow}{$R_t + \gamma G_{t+1}$}) p(s_{t+2}, …| s_{t+1},a_{t+1})ds_{t+2}:a_\infin p(s_{t+1},a_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &=\int\limits_{s_{t+1},a_{t+1}}(R_t + \gamma Q_(s_{t+1}, a_{t+1}))p(s_{t+1},a_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &=\int\limits_{s_{t+1},a_{t+1}}(R_t + \gamma Q_(s_{t+1}, a_{t+1}))\colorbox{aqua}{$p(a_{t+1}|s_t,a_t,s_{t+1})$}p(s_{t+1}|s_t,a_t)ds_{t+1},a_{t+1}\\ &=\int\limits_{s_{t+1},a_{t+1}}(R_t + \gamma Q_(s_{t+1}, a_{t+1}))\colorbox{aqua}{$p(a_{t+1}|s_{t+1})$}\colorbox{lightgreen}{$p(s_{t+1}|s_t,a_t)$}ds_{t+1},a_{t+1}\\ \end{aligned}

Again, by changing the equation, we reduced a calculation for integration, and also found the policy(blue box)and the transition probability(green box) in the equation.

Tommy Kim

I’m interested in artificial intelligence

이전 포스트

2.2 State value function, Action value function & Optimal policy

다음 포스트

2.3 Bellman equation

Reinforcement Learning - hyukppenheim youtube

Bellman equation for value state function

Bellman equation for action value function

2.2 State value function, Action value function & Optimal policy

3.1 Optimal policy - more details

0개의 댓글

관련 채용 정보