The action value function has target policy and transition pdf. We sample N samples from two variables() and then divide by N. As N increases, due to the law of large numbers, we approach the converging Q value, and this process is Q-learning.
Target policy is the policy maximizing next Q(), so its pdf is delta function:
And the action value function becomes:
We also have learned another expression : express with previous (sum of N-1th term), and that is Incremental Monte Carlo method.
Incremental MC :
We regress Q with Deep Neural Networks(DNN). The regression is to find model(function) that explains the relationship between inputs and labels(output). It has loss function, and we use gradient descent to minimize this loss.
The regresssion function is defined by its weights( : if it is linear). And, we update this weights with gradient descent : , W is weight vector, is learning rate, and is gradient descent.
We then represent action value with weights : .
The samples to evaluate error will be TD-target().
DQN is designed for Atari games, such as Breakout. Given that the data from these games is continuous, the agent will come across numerous states. Unlike search problems, where the objective is to find the shortest path from start to goal, we can't train the agent for every possible situation in these games. Therefore, we employ regression. This ensures that the agent acts sensibly even when faced with previously unencountered states!
DQN = Deep Q Network = Q-learning + DNN
DQN employs a CNN model. It takes pixels as input and produces the value function as output. To generate an output for , the number of neurons in the final layer is set to match the number of possible actions. For instance, in the Breakout game, the agent can either move left or right. Consequently, the final layer would have two neurons representing and .
Why do we employ this form?
We obtain samples from the TD-target given by , where is the next action value function. The maximum value of is required. If we used () as inputs, every action would have to be processed through the neural network, leading to extensive computation, which would be highly inefficient. Hence, instead of using both the state and action as inputs, we use the image (representing the state) as the input, and obtain the Q-values from the network's output. We then compare these outputs and select the maximum. This approach is computationally much more efficient.
In contrast to original neural networks, DQN cares about only single output. Once we choose