2.2 State value function, Action value function & Optimal policy

Tommy Kim·2023년 9월 9일

Reinforcement Learning - hyukppenheim youtube

목록 보기

3/13

In this chapter, we will focus more on expected return.

State Value Function is a function for the return expected from now on. This function evaluate the value of the present state. If the agent is given a random state, the goal of state value function is to maximize the return from the given state.

Action Value Function is a function for the return expected from current action. We have learned about function $Q(s_t, a_t)$ at the previous chapter. This function Q is the action value function.

Recall the equation for return:
$G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} …$
And the definition of expectation is:
$E[x] = \int x p(x) dx$
$E[f(x)] = \int f(x)p(x) dx$

Then, we define function of expected return(state value function) as:
$V(s_t) \triangleq \int\limits_{a_t:a_\infin} G_t p(a_t, s_{t+1}, a_{t+1}, s_{t+2},…| s_t) da_t:a_\infin$
$a_t:a_\infin$ means that we integral from $a_t$ to $a_\infin$
The agent do action a lot of times for every input( $a_t, s_{t+1}, a_{t+1} ,…..$ ) and calculate return $G_t$ for every input. The agent do action until it finds the maximum expected return.

We define action value function as:
$Q(s_t, a_t) \triangleq \int\limits_{s_{t+1}:a_\infin} G_t p(s_{t+1}, a_{t+1}, s_{t+2}, a_{t+2}, … | s_t, a_t) ds_{t+1}:a_\infin$

And the optimal policy is eventually the policy that maximizes expected value function. The policy is set of $p(a_t|s_t), p(a_{t+1}|s_{t+1}), … p(a_\infin|s_\infin)$
If we apply bayesian rule to $p(a_t, s_{t+1}, a_{t+1}, s_{t+2},…| s_t)$ , we can get $p(a_t|s_t), p(a_{t+1}|s_{t+1}), … p(a_\infin|s_\infin)$ . We will talk about this more specific in the next chapter.

Tommy Kim

I’m interested in artificial intelligence

이전 포스트

2-1. Markov Decision Process(MDP)

다음 포스트

2.2 State value function, Action value function & Optimal policy

Reinforcement Learning - hyukppenheim youtube

2-1. Markov Decision Process(MDP)

2.3 Bellman equation

0개의 댓글

관련 채용 정보