3.2 Monte Carlo(MC)

Tommy Kim·2023년 9월 12일
0

How can we get QQ^*?

We have learned how to calculate maximum value of state value function by optimal policy. If the QQ^* is given, what we need to do is just find the policy maximizing QQ^*. But, how can we get QQ^*?

Actually, we cannot get QQ^* directly. We believe that if we train the agent with ϵ\epsilon-greedy, its action value function will get closer to QQ^*. It may not be the best policy. But we believe it’s performance may be good enough. After training, we test the agent with greedy action(not ϵ\epsilon-greedy!!)

Monte Carlo : a method of getting QQ^*

Recall expectation function :

E[x]=xxp(x)dx   (for continuous variables)=1Ni=1Nxi   (for discrete variables)1Ni=1Nxi   (In this case, we extract some of xs that follows upper probability distribution and get expectation like probability density function)\begin{aligned} E[x] &= \int\limits_x x p(x)dx \ \ \ (\mathrm{for\ continuous\ variables})\\ &=\frac {1} {N} \sum\limits_{i=1}^N x_i \ \ \ (\mathrm{for \ discrete \ variables})\\ & \approx \frac {1} {N} \sum\limits_{i=1}^N x_i \ \ \ (\mathrm{In\ this \ case,\ we \ extract\ some\ of\ xs\ that\ follows\ upper\ probability\ distribution\ and\ get\ expectation\ like\ probability\ density\ function}) \end{aligned}

The Monte Carlo method operates under the law of large numbers, asserting that as (N) increases, the approximated value in the equation above converges to the actual expectation.

Applying this method, we can express action value function in a different way.
Q(st,at)1Ni=1NGt(i)Q(s_t, a_t) \approx \frac {1}{N} \sum\limits_{i=1}^N G_t^{(i)}
In this approximation, Gt(i)G_t^{(i)} follows the probability distribution(p(st+1,at+1,st,at)p(s_{t+1}, a_{t+1},…|s_t, a_t)).

Let’s get idea by reviewing Q-learning!

Let’s assume we want to evaluate Q(st,at)Q(s_t,a_t) at ‘right’ action(pink area). Initially, the agent explores all possible routes to reach the goal (this includes paths not marked in the given figure). The more paths the agent explores, the closer the approximation becomes to the original equation. However, it takes too much time. But we have to know the idea.

As the agent undergoes numerous episodes using epsilon-greedy, both Q and p improve. We train the agent until Q and p are sufficiently close to QQ^* and the optimal policy, respectively. Moreover, as the agent explores more possible routes, its understanding of the probability distribution becomes more comprehensive.

profile
I’m interested in artificial intelligence

0개의 댓글