In this chapter, we will focus more on expected return.
State Value Function is a function for the return expected from now on. This function evaluate the value of the present state. If the agent is given a random state, the goal of state value function is to maximize the return from the given state.
Action Value Function is a function for the return expected from current action. We have learned about function Q(st,at) at the previous chapter. This function Q is the action value function.
Recall the equation for return:
Gt=Rt+γRt+1+γ2Rt+2…
And the definition of expectation is:
E[x]=∫xp(x)dx
E[f(x)]=∫f(x)p(x)dx
Then, we define function of expected return(state value function) as:
V(st)≜at:a∞∫Gtp(at,st+1,at+1,st+2,…∣st)dat:a∞
at:a∞ means that we integral from at to a∞
The agent do action a lot of times for every input(at,st+1,at+1,…..) and calculate return Gt for every input. The agent do action until it finds the maximum expected return.
We define action value function as:
Q(st,at)≜st+1:a∞∫Gtp(st+1,at+1,st+2,at+2,…∣st,at)dst+1:a∞
And the optimal policy is eventually the policy that maximizes expected value function. The policy is set of p(at∣st),p(at+1∣st+1),…p(a∞∣s∞)
If we apply bayesian rule to p(at,st+1,at+1,st+2,…∣st), we can get p(at∣st),p(at+1∣st+1),…p(a∞∣s∞). We will talk about this more specific in the next chapter.