figure 1
A. Vaswani et al., “Attention Is All You Need”, NeurIPS, 2017
Earlier sequence models had limitations in handling sequential data. To address these issues, Vaswani et al. proposed the Transformer, a fully attention-based architecture that removes recurrence and convolution entirely. By using self-attention, the Transformer enables efficient parallel computation and stronger modeling of long-range dependencies.
In short, the attention function takes a query and a set of key–value pairs and measures how relevant each key is to the query, then returns a weighted sum of the values, giving higher weights to more relevant ones.
If you think of it like a search engine, Query is your search term, Keys are document titles, and Values are the actual contents of those documents.
figure 2
The input is linearly projected using learned weight matrices, producing with dimensions , , and , respectively.
figure 3
Finally, the formula represents all five steps above — from computing relevance to producing context-aware outputs.
figure 4
Instead of a single attention, the Transformer uses multi-head attention to learn from different representation subspaces.
Each head has its own learnable projection matrices that produce and for each head:
Each head then applies scaled dot-product attention in parallel:
figure 5
Finally, the head outputs are concatenated and linearly projected back to , ensuring consistency across all layers in the Transformer.
where
Why it helps:
figure 6
Now that we understand attention and multi-head attention, we can look at the overall Transformer architecture, which is divided into two main parts: an encoder (on the left in the image) and a decoder (on the right).
Both are composed of a stack of identical layers containing several sub-layers and components, and they share several of these sub-layers and components.
Since the model has no recurrence or convolution, it needs a way to represent the order of tokens. To achieve this, sinusoidal positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks:
where is the position and is the dimension.
These encodings share the same dimensionality as embeddings (), allowing them to be added directly.
Positional encoding enables the model to capture sequence order. The sinusoidal form was chosen because it allows the model to learn relative positions easily, since can be represented as a linear function of .
Each sub-layer in both the encoder and decoder is wrapped with a residual connection followed by layer normalization:
where is the input to the sub-layer and is the function implemented by the sub-layer itself.
Why it helps:
To ensure that residual connections work seamlessly, every sub-layer and embedding layer in the model produces outputs with the same dimensionality ().
Each layer in both the encoder and decoder includes a feed-forward network (FFN) that is applied independently at each position.
The FFN consists of two linear layers with a ReLU activation in between:
figure 7
The same weights are shared across all positions within a layer. However, the two layers inside the FFN have their own sets of weights, meaning parameters are not shared between them.
It first expands the representation to a higher dimension () and then projects it back to the model dimension ().
figure 8
The encoder has six identical layers, and each layer contains two main sub-layers: multi-head self-attention and a feed-forward network (FFN).
figure 9
figure 10
The decoder also has six identical layers, but each layer contains three main sub-layers. It generates the output sequence one token at a time, using previously predicted tokens as context.
At the beginning (bottom of the figure10), the output embeddings are added to positional encodings.
figure 11
The first sub-layer is masked multi-head self-attention. It is similar to the encoder’s self-attention, but the key difference is the mask that prevents each position from attending to future tokens. This preserves the auto-regressive property, ensuring the model only sees tokens up to the current position.
figure 12
The second sub-layer is encoder-decoder attention. Here, the queries come from the decoder’s previous sub-layer, while keys and values come from the encoder’s output. This allows every position in the decoder to attend to all positions in the input sequence, integrating the encoded information into the generation process.
The third sub-layer is the feed-forward network. Same as in the encoder, each token's representation passes through the FFN.
Add & Norm: Same as in the encoder, each sub-layer is wrapped with residual connections and normalization.
The table below compares self-attention, recurrent, and convolutional layers in terms of their efficiency and ability to model long-range relationships. Here’s why self-attention stands out:
figure 13
The Transformer was trained on the WMT 2014 English–German (4.5M sentence pairs) and English–French (36M sentence pairs) datasets.
Sentences were tokenized using Byte Pair Encoding (BPE) to reduce vocabulary size and handle rare words efficiently.
Training used the Adam optimizer, along with regularization methods such as dropout (rate = 0.1) and label smoothing (ε = 0.1) to improve generalization and prevent overfitting.
The Transformer achieved state-of-the-art performance on major translation benchmarks:
figure 14
Even the base model outperformed previous RNN and CNN models at far lower training cost.
Beyond translation, the Transformer also performed well on English constituency parsing, showing that its attention-based architecture generalizes effectively to other sequence modeling tasks.
In short, the Transformer trained faster, cost less, and achieved higher accuracy than all prior sequence models—marking a major shift in deep learning architecture design.
While the Transformer achieved remarkable results, it also introduced several challenges that motivated later research:
Quadratic complexity in attention:
The self-attention mechanism compares every token with every other token, causing quadratic growth in memory and computation as the sequence length increases. This makes Transformers inefficient for very long sequences like documents or videos.
→ I think the computational cost could be reduced from quadratic to near-linear by designing more efficient attention mechanisms.
Fixed sequence length:
The model relies on positional encodings with a fixed size, meaning it cannot naturally process sequences longer than those it was trained on. This limits its ability to handle dynamic or streaming data efficiently.
→ I think using relative positional encodings instead of fixed positional encodings allows the model to generalize better to sequences longer than those seen during training.