[WIP] Attention Is All You Need

Estelle Yoon·2025년 3월 18일

WIP

목록 보기
6/13

Date: 2017
Journal: NIPS

1 Introduction

Background

RNN. LSTM. GRU have been firmly established the state of art in sequence modeling, language modeling and machine translation

Problem

Recurrent model is critical at longer sequence, as memory constraints limiting batching across

Many solutions(factorization tricks, conditional computation) improved in computational efficiency while improving performance

But, problem still remains

2 Background

To reduce sequential computation, many CNN based model was proposed

These models computes hidden representations in parallel for all I/O positions

But to relate signals from arbitrary I/O positions grows in the distance between positions, which makes computation difficulties in learning dependencies between distant positions

In transformer, this is reduced to a constant number of operation

It reduce the effective resolution due to averaging attention weighted positions, but can handle with Multi-Head attention

Self-attention is an attention mechanism relating different position of a single sequence to compute a representation of the sequence

3 Model Architecture

Most competitive neural sequence transduction models have an encoder decoder structure

Encoder - input sequence of symbol representation (x1, ..., xnx_1, ~...,~x_n ) to continuous representations (z1, ..., znz1, ~...,~z_n)
Decoder - given continuous representations to output sequence (y1, ..., yny_1, ~...,~y_n)

Transformer follows this overall architecture using self-attention and point-wise, fully connected layers for both encoder and decoder

3.1 Encoder and Decoder Stacks

Encoder

Composed of 6 identical layers

Each layer has two sub-layers - multi head self attention mechanism and position wise fully connected feed forward network

Each sub layer has residual connection and following layer normalization

Decoder

Composed of 6 identical layers

Each layer has three sub-layers - multi head self attention mechanism and same layer to encoder layer

Each sub layer has residual connection like encoder

Modified self attention sub layer to ensure the prediction can only depend on the known output at position less than its position

3.2 Attention

Attention function can be described as mapping a query and a set of key-value pairs to an output

Output is computed as weighted sum, where weight assigned to each value is computed by a compatibility function of query with matching key

3.2.1 Scaled Dot-Product Attention

Weight on values of dimension dvd_vare obtained by computing dot product of the input queries (dimension dkd_k) with keys (dimension dkd_k), divide each by dk\sqrt d_k and apply softmax function

Two most commonly used attention function are additive attention and dot product attention

Two has similar theoretical complexity, but dot product attention is much faster and more space efficient since it can be implemented using highly optimized matrix multiplication

If dkd_k is small two perform similarly, but for larger dkd_k additive attention outperforms dot product without scaling

To counteract this, used scale dot product by 1dk1 \over \sqrt d_k

3.2.2 Multi-Head Attention

Linearly project the queries, keys, values hh times with different learned linear projection to dkd_k, dkd_k and dvd_v dimension respectively is beneficial than using single attention function with dmodeld_{model} dimension keys, values, queries

On each projected version of queries, keys and values, perform attention function in parallel, yielding dvd_v dimensional output values

This reduction in dimension can lower computational cost similar to that of single attention with full dimension

3.2.3 Application of Attention in out Model

Transformer use multi head attention in three different ways

  1. Mimics typical encoder decoder attention mechanism in sequence to sequence models
    In “encode decoder attention” layer, the queries come from the previous decoder layer, and the memory keys and values com from the output of the encoder
    This allow every position in the decoder attend over all position in input sequence
  2. Allow each position in encoder to attend to all positions in previous layer of the encoder
    In self attention layer of encoder, all of the keys, values and queries come from previous layer in the encoder
  3. Similar to decoder, allows to attend to all the position in decoder upto and including that position
    Prevent leftward information flow in the decoder to preserve auto-regressive property
    This paper implemented this inside of scaled dot product attention by masking out all values of illegal connection from input of softmax

3.3 Position-wise Feed-Forward Networks

Attention sub layers in encoder and decoder contains fully connected feed forward network - consist of two linear transformations with ReLU activation between - applied to each position separately and identically

Linear transformation are the same across different positions, but use different parameters from layer to layer

3.4 Embeddings and Softmax

Similar to other sequence models, learned embeddings are used to convert input tokens and output tokens to vector of dimension dmodeld_{model}

Learned linear transformation and softmax function are used to convert decoder output to predict next token probabilities

In this model, same weight matrix is used between two embedding layers and pre-softmax linear transformation - in embedding layers, multiply dmodel\sqrt {d_{model}} to weights

3.5 Positional Encoding

As no recurrence and convolution in the model, positional information is needed

Added “positional encodings” to input embeddings at the bottom of encoder and decoder stacks, in dimension of dmodeld_{model}

Choose sinusoidal positional encoding for easy-learn to attend by relative position, as it can be converted to linear function

Compare to learned positional embeddings, this produced nearly identical result but allows model to extrapolate to sequence length longer

4 Why Self-Attention

Path length between long-range dependencies in network

Key factor to learn dependencies is the length of paths forward and backward signals have to traverse in the network

The shorter these paths, the easier it is to learn long range dependencies

Amount of computation that can be parallelized

Total computational complexity per layer

Compare to Recurrent layer

Self attention layer connects all positions with a constant number of sequentially executed operations O(1)O(1), whereas a recurrent layer requires O(n)O(n)

In terms of computational complexity, self attention layers - O(nd2)O(n \cdot d^2 ) are faster than recurrent layers - O(n2d)O(n^2 \cdot d ) when n<dn<d

For larger nn, restricting self attention to neighbour of size rr can be considered - leave for further work

Compare to Single convolutional layer

A single convolutional layer with kernel width k<nk < n does not connect all pairs of input and output positions

Doing so requires a stack of O(n/k)O(n/k) convolutional layers, which increase the length of longest paths between any two positions in the network O(logk(n))O(log_k (n))

By using separable convolutions, the complexity can be decreased to O(knd+nd2)O(k \cdot n \cdot d + n \cdot d^2 )
Even with k=nk=n, complexity is equal to combination of self attention layer and a point wise feed forward layer

profile
Studying

0개의 댓글