5.1 2013 DQN paper review

Tommy Kim·2023년 9월 19일
0

Q-learning review

Q(st,at)=st+1,at+1(Rt+γQ(st+1,at+1))p(at+1st+1)p(st+1st,at)dst+1,at+1\begin{aligned} Q(s_t,a_t) = \int\limits_{s_{t+1},a_{t+1}} (R_t + \gamma Q(s_{t+1}, a_{t+1}))p(a_{t+1}|s_{t+1}) p(s_{t+1}|s_t,a_t)ds_{t+1},a_{t+1} \end{aligned}

The action value function has target policy and transition pdf. We sample N samples from two variables(st+1,at+1s_{t+1}, a_{t+1}) and then divide by N. As N increases, due to the law of large numbers, we approach the converging Q value, and this process is Q-learning.

Target policy is the policy maximizing next Q(Q(st+1,at+1)Q(s_{t+1},a_{t+1})), so its pdf is delta function:
δ(at+1at+1),at+1arg maxat+1Q(st+1,at+1)\delta(a_{t+1}-a_{t+1}^*), a_{t+1}^* \triangleq \argmax\limits_{a_{t+1}} Q(s_{t+1},a_{t+1})
And the action value function becomes:

Q(st,at)=st+1(Rt+γmaxat+1Q(st+1,at+1))p(st+1st,at)dst+11Ni=1N(Rt+γmaxat+1Q(st+1,at+1))\begin{aligned} Q(s_t,a_t) &=\int\limits_{s_{t+1}}(R_t + \gamma \max\limits_{a_{t+1}}Q(s_{t+1}, a_{t+1}))p(s_{t+1}|s_t,a_t)ds_{t+1}\\ &\approx \frac 1 N \sum\limits_{i=1}^N (R_t + \gamma \max\limits_{a_{t+1}}Q(s_{t+1}, a_{t+1})) \end{aligned}

We also have learned another expression : express with previous QQ (sum of N-1th term), and that is Incremental Monte Carlo method.
Incremental MC : Q(st,at)(1α)Q(st,at)+α(Rt+γmaxat+1Q(st+1,at+1)Q(s_t,a_t) \leftarrow (1-\alpha)Q(s_t,a_t) + \alpha (R_t + \gamma\max\limits_{a_{t+1}} Q(s_{t+1},a_{t+1})

Regression with DNN

simple review of regression

We regress Q with Deep Neural Networks(DNN). The regression is to find model(function) that explains the relationship between inputs and labels(output). It has loss function, and we use gradient descent to minimize this loss.
The regresssion function is defined by its weights(w1x1+w2x2+...w_1 x_1 + w_2x_2 + ... : if it is linear). And, we update this weights with gradient descent : W WαGraW \leftarrow \ W - \alpha G_{ra}, W is weight vector, α\alpha is learning rate, and GraG_{ra} is gradient descent.
We then represent action value with weights : QwQ_w.
The samples to evaluate error will be TD-target(Rt+γmaxat+1Q(st+1,at+1R_t + \gamma\max\limits_{a_{t+1}} Q(s_{t+1},a_{t+1}).

Why regression?

DQN is designed for Atari games, such as Breakout. Given that the data from these games is continuous, the agent will come across numerous states. Unlike search problems, where the objective is to find the shortest path from start to goal, we can't train the agent for every possible situation in these games. Therefore, we employ regression. This ensures that the agent acts sensibly even when faced with previously unencountered states!

DQN = Deep Q Network = Q-learning + DNN

DQN(2013) by DeepMind Technologies

DQN employs a CNN model. It takes pixels as input and produces the value function as output. To generate an output for QQ, the number of neurons in the final layer is set to match the number of possible actions. For instance, in the Breakout game, the agent can either move left or right. Consequently, the final layer would have two neurons representing Q(st,left)Q(s_t,left) and Q(st,right)Q(s_t, right).

Why do we employ this form?
We obtain samples from the TD-target given by Rt+γmaxQR_t + \gamma \max Q', where QQ' is the next action value function. The maximum value of QQ' is required. If we used (st,ats_t,a_t) as inputs, every action would have to be processed through the neural network, leading to extensive computation, which would be highly inefficient. Hence, instead of using both the state and action as inputs, we use the image (representing the state) as the input, and obtain the Q-values from the network's output. We then compare these outputs and select the maximum. This approach is computationally much more efficient.

In contrast to original neural networks, DQN cares about only single output. Once we choose

profile
I’m interested in artificial intelligence

0개의 댓글