Q-learning
We use off-policy in the Q-learning. We apply greedy action for target policy, and ϵ-greedy action for behavior policy.
equation for policies
target policy
Since we use greedy action for target policy, we need the best action for every future action(we use delta function!!)
target:p(at+1∣st+1)=δ(at+1−at+1∗),at+1∗=at+1argmaxQ(st+1,at+1)
We have learn this idea(delta function) at 3.1 - optimal policy.
behavior policy
The behavior policy follows ϵ-greedy action.
final equation
Q(st,at)=st+1,at+1∫(Rt+γQ(st+1,at+1))p(at+1∣st+1)p(st+1∣st,at)dst+1,at+1=st+1,at+1∫(Rt+γQ(st+1,at+1))δ(at+1−at+1∗)p(st+1∣st,at)dst+1,at+1Since the delta function is pdf(its integral is 1)and the delta function makes everything 0 except at+1∗,the equation becomes:=st+1∫(Rt+γat+1maxQ(st+1,at+1))p(st+1∣st,at)dst+1
from the final equation, we now can sample a data(TD-target).
Rt+γat+1maxQ(st+1,at+1)