RNNs use the idea of processing sequential information. The term “recurrent” applies as they perform the same task over each instance of the sequence such that the output is dependent on the previous computations and results. Generally, a fixed-size vector is produced to represent a sequence by feeding tokens one by one to a recurrent unit. In a way, RNNs have “memory” over previous computations and use this information in current processing.

: the input to the network at time step
: the hidden state at time step
Caculation of is based as per the equation:

-> is caculated based on the current input and the previous time step's hidden state
-> is considered as the network’s memory element that accumulates information from other time steps
The function : a non-linear transformation such as
: weights that are shared across time
Network Architecture
Let the network have units, with external input lines.
Let denote the -tuple of outputs of the units in the network at time .
Let denote the -tuple of external input signals to the network at time t.
We also define to be the -tuple obtained by concatenating and in some convenient fashion.
Let denote the set of indices such that the component of , is the output of a unit in the network.
Let denote the set of indices for which is an external input.
Furthermore, we assume that the indices on and are chosen to correspond to those of x, so that

Let denote the weight matrix for the network, with a unique weight between every pair of units and also from each input line to each unit.
The element represents the weight on the connection to the unit from either the unit, if , or the input line, if .
Furthermore, note that to accommodate a bias for each unit we simply include among the input lines one input whose value is always 1; the corresponding column of the weight matrix contains as its element the bias for unit . In general, our naming convention dictates that we regard the weight as having as its “presynaptic” signal and as its “postsynaptic” signal.


For each k, the intermediate variable represents the net input to the unit at time . Its value at time is computed in terms of both the state of and input to the network at time by

The longer one clarifies how the unit outputs and the external inputs are both used in the computation, while the more compact expression illustrates why we introduced and the corresponding indexing convention above.
The output of such a unit at time is then expressed in terms of the net input by

where is the unit's squashing function.
In those cases where a specific assumption (differentiable) about these squashing functions is required, it will be assumed that all units use the logistic function.
Network Performance Measure
-> Note that this formulation allows for the possibility that target values are specified for different units at different times. Denote the negative of the overall network error at time t.

A natural objective of learning might be to maximize the negative of the total error oversome appropriate time period .

One natural wat to make the weight changes is along a constant positive multiple of the performance measure gradient, so that

for each and , where is a positive learning rate parameter.




<References>