4.2 Q-learning(advanced)

Tommy Kim·2023년 9월 16일

reinforcement learning

0

Reinforcement Learning - hyukppenheim youtube

목록 보기

10/13

Q-learning

We use off-policy in the Q-learning. We apply greedy action for target policy, and $\epsilon$ -greedy action for behavior policy.

equation for policies

target policy

Since we use greedy action for target policy, we need the best action for every future action(we use delta function!!)
$\mathrm{target} : p(a_{t+1}|s_{t+1}) = \delta(a_{t+1} - a_{t+1}^*), \\a_{t+1}^* = \argmax\limits_{a_{t+1}} Q(s_{t+1}, a_{t+1})$
We have learn this idea(delta function) at 3.1 - optimal policy.

behavior policy

The behavior policy follows $\epsilon$ -greedy action.

final equation

\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1},a_{t+1}} (R_t + \gamma Q(s_{t+1}, a_{t+1}))\colorbox{lightgreen}{$p(a_{t+1}|s_{t+1)}$}\colorbox{yellow}{$p(s_{t+1}|s_t,a_t)$}ds_{t+1},a_{t+1}\\ & = \int\limits_{s_{t+1},a_{t+1}} (R_t + \gamma Q(s_{t+1}, a_{t+1}))\colorbox{lightgreen}{$\delta(a_{t+1} - a_{t+1}^*) $}\colorbox{yellow}{$p(s_{t+1}|s_t,a_t)$}ds_{t+1},a_{t+1}\\ &\mathrm{Since \ the \ delta\ function\ is \ pdf(its\ integral\ is\ 1)}\\ &\mathrm{and \ the\ delta\ function\ makes \ everything\ 0\ except } \ a_{t+1}^*,\\ &\mathrm{the\ equation\ becomes:}\\ &=\int\limits_{s_{t+1}}(R_t + \gamma \max\limits_{a_{t+1}}Q(s_{t+1}, a_{t+1}))p(s_{t+1}|s_t,a_t)ds_{t+1} \end{aligned}

from the final equation, we now can sample a data(TD-target).
$R_t + \gamma \max\limits_{a_{t+1}}Q(s_{t+1},a_{t+1})$

I’m interested in artificial intelligence

이전 포스트

4.1 On-policy vs Off-policy

다음 포스트

4.3 SARSA vs Q-learning

0개의 댓글

관련 채용 정보