Factor symbol position of input & output sequences
Parallization almost impossible
Cannot handle dependencies among long sequential input (linear/log increase of operations when trying to relate signals from distant two positions)
Only based on "Attention Mechanism" (omitting RNN/CNN structure from model)
More paralizable (faster training & better quality)
Generalize well in other task (even in extreme cases "large/scarce data)
1 layer -> 2 sub-layer -> [multi-head self-attention mechanism] + [position wise fully connected FFN]
Each sub layer -> Residual Connection +
Normalization
Output:
LayerNorm(x + Sublayer(x))
Dimension: 512
- Sublayer(x) = function implemented by sublayer itself
1 layer -> 2 sub-layer (from above) + 1 sub-layer
Each sub layer -> Residual Connection +
Normalization
Plus one sublayer ->
-Masking added on self attention sub layer to prevent positions from attending to subsequent positions
-Ensure prediction for a position to only depend on known outputs of previous positions
Encoder-Decoder Attention Layers
Input: Q (query from decoder), K (memory keys) V (encoder output values)
Serves purpose of encoder-decoder mechanism from 'previous' mainstream models
Self-Attention Layer in Encoder
Input: K,V,Q (output of previous encoder layer)
Each position in encoder can access all position from previous layer
Self-Attention Layer in Decoder
Allow access of all previous positions within decoder (including current)
Scaled Dot-Product attention implemented by masking input values of softmax to prevent "Leftward information flow"
What is Attention?
Attention Function map query and set of key-value pairs (vectors) to all outputs (vectors)
Output = Weighted Sum of Values
There are two methods calculating the weighted sum
Scaled Dot-Product Attention
Input: Q (Queries), dk (Keys of Dimesion), dv (Values of Dimension)
Attention(Q,K,V) =
Softmax(Dot_product(query, all_keys) / sqrt(dk))
Multi-Head Attention
Linearly projecting Q, K, V (h times) with different 'previously learned' linear projections to dk, dk, dv dimension
Improvement over single-head:
Allow model to gain access to information from differently represented subspaces at different positions
Projected Q,K,V
-> Allow attention function to perform in parallel manner
-> Output = dv dimensional
-> Each concatenated and re-projected
-> return final values
Self-Attention
Connect all position with constant number of sequentially executed operations
Best performance when sequence length is smaller than representation dimensionality (which accounts for most of the case)
For improving performance on very long sequence, self-attention could be restricted to only consider certain neighborhood size on input sequence
Why Self-Attention was chosen for Transformer?
Made up of three parts
I/O Dimension = 512
Inner layer dimension (dff) = 2048
Usage of learned embedding
Learned linear transformation & softmax
Embedding weight matrix * sqrt(dmodel) == linear transformation matrix used right before softmax
Positional Encoding
Purpose:
Add positional encodings using sine and cosine functions of different frequencies to input embeddings at the bottom of the encoder and decoder stacks (to make up for not using Recurrence/Convolution)
Why use Sinusoid?
Sinusoid allow model to extrapolate to sequence length longer than 'other' versions that were tested (ex. linear)