Date: 2017
Journal: NIPS
RNN. LSTM. GRU have been firmly established the state of art in sequence modeling, language modeling and machine translation
Recurrent model is critical at longer sequence, as memory constraints limiting batching across
Many solutions(factorization tricks, conditional computation) improved in computational efficiency while improving performance
But, problem still remains
To reduce sequential computation, many CNN based model was proposed
These models computes hidden representations in parallel for all I/O positions
But to relate signals from arbitrary I/O positions grows in the distance between positions, which makes computation difficulties in learning dependencies between distant positions
In transformer, this is reduced to a constant number of operation
It reduce the effective resolution due to averaging attention weighted positions, but can handle with Multi-Head attention
Self-attention is an attention mechanism relating different position of a single sequence to compute a representation of the sequence
Most competitive neural sequence transduction models have an encoder decoder structure
Encoder - input sequence of symbol representation ( ) to continuous representations ()
Decoder - given continuous representations to output sequence ()
Transformer follows this overall architecture using self-attention and point-wise, fully connected layers for both encoder and decoder
Composed of 6 identical layers
Each layer has two sub-layers - multi head self attention mechanism and position wise fully connected feed forward network
Each sub layer has residual connection and following layer normalization
Composed of 6 identical layers
Each layer has three sub-layers - multi head self attention mechanism and same layer to encoder layer
Each sub layer has residual connection like encoder
Modified self attention sub layer to ensure the prediction can only depend on the known output at position less than its position
Attention function can be described as mapping a query and a set of key-value pairs to an output
Output is computed as weighted sum, where weight assigned to each value is computed by a compatibility function of query with matching key
Weight on values of dimension are obtained by computing dot product of the input queries (dimension ) with keys (dimension ), divide each by and apply softmax function
Two most commonly used attention function are additive attention and dot product attention
Two has similar theoretical complexity, but dot product attention is much faster and more space efficient since it can be implemented using highly optimized matrix multiplication
If is small two perform similarly, but for larger additive attention outperforms dot product without scaling
To counteract this, used scale dot product by
Linearly project the queries, keys, values times with different learned linear projection to , and dimension respectively is beneficial than using single attention function with dimension keys, values, queries
On each projected version of queries, keys and values, perform attention function in parallel, yielding dimensional output values
This reduction in dimension can lower computational cost similar to that of single attention with full dimension
Transformer use multi head attention in three different ways
Attention sub layers in encoder and decoder contains fully connected feed forward network - consist of two linear transformations with ReLU activation between - applied to each position separately and identically
Linear transformation are the same across different positions, but use different parameters from layer to layer
Similar to other sequence models, learned embeddings are used to convert input tokens and output tokens to vector of dimension
Learned linear transformation and softmax function are used to convert decoder output to predict next token probabilities
In this model, same weight matrix is used between two embedding layers and pre-softmax linear transformation - in embedding layers, multiply to weights
As no recurrence and convolution in the model, positional information is needed
Added “positional encodings” to input embeddings at the bottom of encoder and decoder stacks, in dimension of
Choose sinusoidal positional encoding for easy-learn to attend by relative position, as it can be converted to linear function
Compare to learned positional embeddings, this produced nearly identical result but allows model to extrapolate to sequence length longer
Key factor to learn dependencies is the length of paths forward and backward signals have to traverse in the network
The shorter these paths, the easier it is to learn long range dependencies
Self attention layer connects all positions with a constant number of sequentially executed operations , whereas a recurrent layer requires
In terms of computational complexity, self attention layers - are faster than recurrent layers - when
For larger , restricting self attention to neighbour of size can be considered - leave for further work
A single convolutional layer with kernel width does not connect all pairs of input and output positions
Doing so requires a stack of convolutional layers, which increase the length of longest paths between any two positions in the network
By using separable convolutions, the complexity can be decreased to
Even with , complexity is equal to combination of self attention layer and a point wise feed forward layer