5.1 2013 DQN paper review

Tommy Kim·2023년 9월 19일

Reinforcement Learning - hyukppenheim youtube

목록 보기

13/13

Q-learning review

\begin{aligned} Q(s_t,a_t) = \int\limits_{s_{t+1},a_{t+1}} (R_t + \gamma Q(s_{t+1}, a_{t+1}))p(a_{t+1}|s_{t+1}) p(s_{t+1}|s_t,a_t)ds_{t+1},a_{t+1} \end{aligned}

The action value function has target policy and transition pdf. We sample N samples from two variables( $s_{t+1}, a_{t+1}$ ) and then divide by N. As N increases, due to the law of large numbers, we approach the converging Q value, and this process is Q-learning.

Target policy is the policy maximizing next Q( $Q(s_{t+1},a_{t+1})$ ), so its pdf is delta function:
$\delta(a_{t+1}-a_{t+1}^*), a_{t+1}^* \triangleq \argmax\limits_{a_{t+1}} Q(s_{t+1},a_{t+1})$
And the action value function becomes:

\begin{aligned} Q(s_t,a_t) &=\int\limits_{s_{t+1}}(R_t + \gamma \max\limits_{a_{t+1}}Q(s_{t+1}, a_{t+1}))p(s_{t+1}|s_t,a_t)ds_{t+1}\\ &\approx \frac 1 N \sum\limits_{i=1}^N (R_t + \gamma \max\limits_{a_{t+1}}Q(s_{t+1}, a_{t+1})) \end{aligned}

We also have learned another expression : express with previous $Q$ (sum of N-1th term), and that is Incremental Monte Carlo method.
Incremental MC : $Q(s_t,a_t) \leftarrow (1-\alpha)Q(s_t,a_t) + \alpha (R_t + \gamma\max\limits_{a_{t+1}} Q(s_{t+1},a_{t+1})$

Regression with DNN

simple review of regression

We regress Q with Deep Neural Networks(DNN). The regression is to find model(function) that explains the relationship between inputs and labels(output). It has loss function, and we use gradient descent to minimize this loss.
The regresssion function is defined by its weights( $w_1 x_1 + w_2x_2 + ...$ : if it is linear). And, we update this weights with gradient descent : $W \leftarrow \ W - \alpha G_{ra}$ , W is weight vector, $\alpha$ is learning rate, and $G_{ra}$ is gradient descent.
We then represent action value with weights : $Q_w$ .
The samples to evaluate error will be TD-target( $R_t + \gamma\max\limits_{a_{t+1}} Q(s_{t+1},a_{t+1}$ ).

Why regression?

DQN is designed for Atari games, such as Breakout. Given that the data from these games is continuous, the agent will come across numerous states. Unlike search problems, where the objective is to find the shortest path from start to goal, we can't train the agent for every possible situation in these games. Therefore, we employ regression. This ensures that the agent acts sensibly even when faced with previously unencountered states!

DQN = Deep Q Network = Q-learning + DNN

DQN(2013) by DeepMind Technologies

DQN employs a CNN model. It takes pixels as input and produces the value function as output. To generate an output for $Q$ , the number of neurons in the final layer is set to match the number of possible actions. For instance, in the Breakout game, the agent can either move left or right. Consequently, the final layer would have two neurons representing $Q(s_t,left)$ and $Q(s_t, right)$ .

Why do we employ this form?
We obtain samples from the TD-target given by $R_t + \gamma \max Q'$ , where $Q'$ is the next action value function. The maximum value of $Q'$ is required. If we used ( $s_t,a_t$ ) as inputs, every action would have to be processed through the neural network, leading to extensive computation, which would be highly inefficient. Hence, instead of using both the state and action as inputs, we use the image (representing the state) as the input, and obtain the Q-values from the network's output. We then compare these outputs and select the maximum. This approach is computationally much more efficient.

In contrast to original neural networks, DQN cares about only single output. Once we choose

Tommy Kim

I’m interested in artificial intelligence

이전 포스트