[Paper Review] Attention Is All You Need

김까치·2023년 10월 12일
0

paper review

목록 보기
2/2

Abstract

  • We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions
  • Experiments on two machine translation tasks

Introduction

  • Recurrent models - fundamental constraint of sequential computation
  • Transformer - eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output; allows for significantly more parallelization

Model Architecture

Encoder and Decoder Stacks

Encoder

  • composed of a stack of N=6N= 6 identical layers
  • each layer has two sub-layers
    • multi-head self-attention mechanism
    • position-wise fully connected feed-forward network
  • employ a residual connection around each of the two sub-layers, followed by layer normalization
    LayerNorm(x+Sublayer(x))LayerNorm(x + Sublayer(x))
  • To facilitate residual connections, all sub-layers in the model, as well as embedding layers, produce outputs of dimension dmodel=512d_{model} = 512

Decoder

  • also composed of a stack of N=6N= 6 identical layers
  • In addition to the two sub-layers in each encoder layer
    - modify the self-attention sub-layer in the decoder stack(masking) to prevent positions from attending to subsequent positions
  • decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack
  • employ residual connections around each of the sub-layers, followed by layer normalization

Attention

Scaled Dot-Product Attention

  • input consists of
    queries and keys of dimension dkd_k
    values of dimension dvd_v
  • queries,keys, values are packed into matrices QQ, KK, VV
  • compute the matrix of outputs as:
  1. compute the dot products of the query with all keys
  2. divide each by dk\sqrt{d_k}
  3. (only in decoder)
    masking out (setting to −∞) values correspond to illegal connections
  4. apply a softmax function to obtain the weights on the values
  • for large values of dkd_k, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1dk\frac{1}{\sqrt{d_k}}

Multi-Head Attention

  1. linearly project the queries, keys and values hh times
    with different, learned linear projections
  2. on each of these projected versions of queries, keys and values, we then perform the attention function in parallel
  3. these are concatenated
  4. and once again projected, resulting in the final values
  • Q,K,V=T×dmodel=T×512Q, K, V = T \times d_{model} = T \times 512
    h=8h = 8
    dk=dv=dmodel/h=64d_k = d_v = d_{model}/h = 64
    WiQ,WiK,WiVdmodel×dk=512×64W_i^Q, W_i^K, W_i^V ∈ d_{model} \times d_k = 512 \times 64
    WO=hdv×dmodel=512×512W^O = hd_v \times d_{model} = 512 \times 512
  • headi=Attention(QWiQ,KWiK,VWiV)=(T×512)×(512×64)=T×64head_i = Attention(QW_i^Q, KW_i^K, VW_i^V) = (T \times 512) \times (512 \times 64) = T \times 64
    Concat=T×512Concat = T \times 512
    Concat×WO=(T×512)×(512×512)=T×512Concat \times W^O = (T \times 512) \times (512 \times 512) = T \times 512
    MultiHead=512MultiHead = 512
  • allows the model to jointly attend to information from different representation subspaces at different positions

Applications of Attention in out Model

Transformer uses multi-head attention in three different ways:

  • encoder-decoder attention
    queries come from the previous decoder layer
    keys and values come from the output of the encoder
  • self-attention layers in encoder
    all of the keys, values and queries come from the output of the previous layer in the encoder
  • self-attention layers in decoder
    all of the keys, values and queries come from the output of the previous layer in the decoder
    prevent leftward information flow by masking

Position-wise Feed-Forward Netwoks

  • each of the layers in our encoder and decoder contains a fully
    connected feed-forward network, which is applied to each position separately and identically
  • While the linear transformations are the same across different positions, they use different parameters from layer to layer
  • dimensionality of input and output is dmodel=512d_{model} = 512
  • inner-layer has dimensionality dff=2048d_{ff} = 2048

Embeddings and Softmax

  • use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel=512d_{model} =512
    • multiply those weights by dmodel\sqrt{d_{model}}
  • use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities
  • share the same weight matrix between the two embedding layers and the pre-softmax linear transformation

Positional Encoding

  • add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks
  • have the same dimension dmodeld_{model} as the embeddings, so that the two can be summed
  • pospos is the position and ii is the index of dimension

Training

Training Data and Batching

  • trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs
  • Sentences were encoded using byte-pair encoding, which has a shared sourcetarget vocabulary of about 37000 tokens
  • Sentence pairs were batched together by approximate sequence length
  • Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens

Hardware and Schedule

  • trained our models on one machine with 8 NVIDIA P100 GPUs
  • trained the base models for a total of 100,000 steps or 12 hours

Optimizer

Regularization

Results

Machine Translation

  • For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals
  • used beam search with a beam size of 4 and length penalty α=0.6α = 0.6
  • set the maximum output length during inference to input length + 50, but terminate early when possible

Model Variations


dffd_{ff} feed-forward dimension
hh number of attention heads
dkd_k attention key dimension
dvd_v attention value dimension
PdropP_{drop} dropout rate

profile
개발자 연습생

0개의 댓글