4.2 Q-learning(advanced)

Tommy Kim·2023년 9월 16일
0

Q-learning

We use off-policy in the Q-learning. We apply greedy action for target policy, and ϵ\epsilon-greedy action for behavior policy.

equation for policies

target policy

Since we use greedy action for target policy, we need the best action for every future action(we use delta function!!)
target:p(at+1st+1)=δ(at+1at+1),at+1=arg maxat+1Q(st+1,at+1)\mathrm{target} : p(a_{t+1}|s_{t+1}) = \delta(a_{t+1} - a_{t+1}^*), \\a_{t+1}^* = \argmax\limits_{a_{t+1}} Q(s_{t+1}, a_{t+1})
We have learn this idea(delta function) at 3.1 - optimal policy.

behavior policy

The behavior policy follows ϵ\epsilon-greedy action.

final equation

Q(st,at)=st+1,at+1(Rt+γQ(st+1,at+1))p(at+1st+1)p(st+1st,at)dst+1,at+1=st+1,at+1(Rt+γQ(st+1,at+1))δ(at+1at+1)p(st+1st,at)dst+1,at+1Since the delta function is pdf(its integral is 1)and the delta function makes everything 0 except at+1,the equation becomes:=st+1(Rt+γmaxat+1Q(st+1,at+1))p(st+1st,at)dst+1\begin{aligned} Q(s_t,a_t) &= \int\limits_{s_{t+1},a_{t+1}} (R_t + \gamma Q(s_{t+1}, a_{t+1}))\colorbox{lightgreen}{$p(a_{t+1}|s_{t+1)}$}\colorbox{yellow}{$p(s_{t+1}|s_t,a_t)$}ds_{t+1},a_{t+1}\\ & = \int\limits_{s_{t+1},a_{t+1}} (R_t + \gamma Q(s_{t+1}, a_{t+1}))\colorbox{lightgreen}{$\delta(a_{t+1} - a_{t+1}^*) $}\colorbox{yellow}{$p(s_{t+1}|s_t,a_t)$}ds_{t+1},a_{t+1}\\ &\mathrm{Since \ the \ delta\ function\ is \ pdf(its\ integral\ is\ 1)}\\ &\mathrm{and \ the\ delta\ function\ makes \ everything\ 0\ except } \ a_{t+1}^*,\\ &\mathrm{the\ equation\ becomes:}\\ &=\int\limits_{s_{t+1}}(R_t + \gamma \max\limits_{a_{t+1}}Q(s_{t+1}, a_{t+1}))p(s_{t+1}|s_t,a_t)ds_{t+1} \end{aligned}

from the final equation, we now can sample a data(TD-target).
Rt+γmaxat+1Q(st+1,at+1)R_t + \gamma \max\limits_{a_{t+1}}Q(s_{t+1},a_{t+1})

profile
I’m interested in artificial intelligence

0개의 댓글