2.2 State value function, Action value function & Optimal policy

Tommy Kim·2023년 9월 9일
0

In this chapter, we will focus more on expected return.

State Value Function is a function for the return expected from now on. This function evaluate the value of the present state. If the agent is given a random state, the goal of state value function is to maximize the return from the given state.

Action Value Function is a function for the return expected from current action. We have learned about function Q(st,at)Q(s_t, a_t) at the previous chapter. This function Q is the action value function.

Recall the equation for return:
Gt=Rt+γRt+1+γ2Rt+2G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} …
And the definition of expectation is:
E[x]=xp(x)dxE[x] = \int x p(x) dx
E[f(x)]=f(x)p(x)dxE[f(x)] = \int f(x)p(x) dx

Then, we define function of expected return(state value function) as:
V(st)at:aGtp(at,st+1,at+1,st+2,st)dat:aV(s_t) \triangleq \int\limits_{a_t:a_\infin} G_t p(a_t, s_{t+1}, a_{t+1}, s_{t+2},…| s_t) da_t:a_\infin
at:aa_t:a_\infin means that we integral from ata_t to aa_\infin
The agent do action a lot of times for every input(at,st+1,at+1,..a_t, s_{t+1}, a_{t+1} ,…..) and calculate return GtG_t for every input. The agent do action until it finds the maximum expected return.

We define action value function as:
Q(st,at)st+1:aGtp(st+1,at+1,st+2,at+2,st,at)dst+1:aQ(s_t, a_t) \triangleq \int\limits_{s_{t+1}:a_\infin} G_t p(s_{t+1}, a_{t+1}, s_{t+2}, a_{t+2}, … | s_t, a_t) ds_{t+1}:a_\infin

And the optimal policy is eventually the policy that maximizes expected value function. The policy is set of p(atst),p(at+1st+1),p(as)p(a_t|s_t), p(a_{t+1}|s_{t+1}), … p(a_\infin|s_\infin)
If we apply bayesian rule to p(at,st+1,at+1,st+2,st)p(a_t, s_{t+1}, a_{t+1}, s_{t+2},…| s_t), we can get p(atst),p(at+1st+1),p(as)p(a_t|s_t), p(a_{t+1}|s_{t+1}), … p(a_\infin|s_\infin). We will talk about this more specific in the next chapter.

profile
I’m interested in artificial intelligence

0개의 댓글