Attention Is All You Need

Hyungseop Lee·2024년 4월 2일

[Paper Review] Transformer for NLP

목록 보기

1/2

이해를 얻은 영상

https://www.youtube.com/watch?v=6s69XY025MU

Paper Info

Authors : Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
subject : neural information processing systems 30 (2017).
link : https://arxiv.org/abs/1706.03762

몰랐던 용어, 사전지식

https://www.youtube.com/watch?v=bCz4OMemCcA

BLEU

(출처 : https://ko.wikipedia.org/wiki/BLEU)

BLEU(Bilingual Evaluation Understudy) :
- 하나의 자연어에서 다른 자연어로 기계 번역된 텍스트의 품질을 평가하는 알고리즘.
- 품질은 기계의 출력과 인간의 출력 사이의 대응으로 간주된다.
- 기계 번역이 전문적인 인간 번역에 가까울수록 품질이 더 좋다.
  
  자연어 처리 분야에서 사용하는 평가 지표인듯.
  나는 vision task에 관심이 많기 때문에 이 이상의 detail은 생략...
  그냥 간단히 model의 출력이 인간의 출력에 얼만큼 비슷한지 평가하는 지표로 생각.

Abstract

우리는 reccurence 및 convolution을 제외하고 오직 attention mechanism에 기반한
a new simple network architecture인 Transformer를 제안한다.
두 가지 machine translation tasks에 대한 실험 결과,
transformer model들이 품질 면에서 우수하며 parallelizable and train time이 훨씬 적다는 것을 보여준다.
우리 model은 WMT 2014 English-to-German translation task에서 28.4 BLEU를 달성하여 기존의 state-of-the-art 결과를 능가했다.

1. Introduction

RNN가 특히 LSTM과 같은 gated recurrent neural networks는
language modeling 및 machine translation에서
state of the art approaches로 확립되어져 왔다.
(RNN 단점)
Recurrent model은 본질적으로 training examples에서 parallelization을 배제하며,
memory constraint으로 인해 긴 sequence lengths에서 critical하다.
최근 이를 해결하기 위한 여러 연구들이 있긴 했지만, 여전히 근본적인 제약이 남아있다..
Attention mechanism은 input이나 output sequences의 거리에 관계없이
dependencies를 modeling할 수 있도록 해주어 다양한 task에서 좋은 sequence modeling 및 transduction models의 필수 요소가 되었다.
이 논문에서는 recurrent를 배제하고 input 및 output 간의 global dependencies을
완전히 attention mechanism에 의존하는 model architecture인 Transformer를 제안한다.
Transformer는 more parallelization을 가능하게 하며, 8개의 P100 GPU에서 12시간만 training 후에도 번역 품질은 new state of the art 수준을 도달할 수 있게 되었다.

2. Background

(앞선 논문들의 문제점)
sequential computation을 줄이는 것을 목표로 하는 Extended Neural GPU, ByteNet, ConvS2S의 기초적인 building block으로서
CNN을 사용하는 model들은 모든 input and output position에 대해 병렬로 hidden representations을 계산한다.
이러한 model에서 두 가지 임의의 input 또는 output 위치에서 signal을 관련시키기 위해 필요한 작업 수는 위치 간의 거리에 따라 선형적으로(ConvS2S의 경우) 또는 log적으로(ByteNet의 경우) 증가한다.
이로 인해 먼 위치 간의 dependencies를 학습하는 것이 더 어려워진다.

Transformer에서는 이러한 작업 수를 일정한 수로 줄였지만,
평균화된 attention-weighted position으로 인해 effective resolution가 감소하는 비용이 발생한다. (?)
이 효과는 section 3.2에서 설명된 Multi-Head Attention으로 상쇄시킴으로써 극복됨.
(attention mechanism의 성공사례, 연구들)
Self-attention, intra-attention이라고도 불리는 이는 single sequence의 다양한 위치 간의 관계를
나타내어 sequence의 representation을 계산하는 attention mechanism이다.
Self-attention은 reading comprehension, abstractive summarization, tectual entailment and learning task-independent representations 등 다양한 작업에서 성공적으로 사용됨.

End-to-end memory networks는 sequence에 맞춘 recurrence 대신에 attention mechanism을 기반으로 하며,
simple-language question answering and language modeling tasks에서 잘 수행된다는 것이 입증되었다[34].
(이 논문에서 최초로 attention mechanism만 사용한 Transformer를 소개)
그러나 우리가 알고 있는 선에서,
Transformer는 sequence에 맞춘 RNN 또는 Convolution을 사용하지 않고,
input과 output의 representation을 계산하는 데에
전적으로 self-attention에 의존하는 최초의 model transduction(변환)이다.
(다른 논문에서는 RNN, CNN에 attention mechanism을 써서 좋은 성능을 보였는데
이 논문에서는 RNN, CNN을 아예 쓰지 않고 attention mechanism만 써서 input과 output의 representation을 계산하는 듯...)
다음 section에서는, self-attention 기반 Transformer에 대해 설명하고,
다른 model에 비해 갖는 장점을 논의할 것임.

3 Model Architecture

대부분의 sequence transduction models들은 encoder-decoder structure이다.
여기서,
the encoder는 symbol representations $(x_1, ..., x_n)$ 의 input sequence를
연속적인 표현 $z=(z_1,...,z_n)$ 으로 mapping한다.
주어진 $z$ 로, decoder는 한 번에 하나의 element로 symbol의 출력 $(y_1, ..., y_m)$ 을 생성한다.
각 단계에서 model은 auto-regressive이며, 다음을 생성할 때 이전에 생성된 symbol들을 추가 input으로 사용한다.

3.1 Encoder and Decoder Stacks

Encoder

Encoder는 $N=6$ identical layers로 stack되어져있다.
- 각 layer는 2개의 sub-layer들을 갖는다.
  1. 첫번째는 multi-head self-attention mechanism
  2. 두번째는 a simple, position-wise fully connected feed-forward network.
우리는 2개의 sub-layer들에게 각각 residual connection을 적용했고, layer normalization을 거친다.
즉, the output of each sub-layer is
$LayerNorm(x + Sublayer(x))$ , where $Sublayer(x)$ is the function implemented by the sub-layer itself.
이 residual connection들을 편하게 사용하기 위해,
model에 있는 모든 sub-layers들 뿐만 아니라 embedding layers들은
outputs of dimension $d_{model}=512$ 를 만든다.

Decoder

Decoder도 $N=6$ identical layers로 stack되어져있다.
encoder layer의 2개의 sub-layers들에 추가로,
decoder는 encoder stack의 output에 대한 multi-head attention을 수행하는 세 번째 sub-layer를 삽입했다.
encoder와 비슷하게, 각각의 sub-layer에 residual connection을 적용했고, layer normalization을 거친다.
또한 decoder의 self-attention sub-layer를 수정하여
각 position이 그 뒤의 position에 대해 attention할 수 없도록 했다.
이러한 masking은, output embedding이 한 position씩 offset되기 때문에
position $i$ 에 대한 prediction이 $i$ 보다 작은 position의 알려진 output에만 의존하도록 보장한다.

3.2 Attention

An attention function은 query와 key-value pair를 출력으로 mapping하는 것으로 설명할 수 있다.
여기서 query, keys, values 및 output은 모두 vector이다.
output은 weighted sum of the values로 계산되며,
각 value에 할당된 weight는 해당하는 key와의 compatibility function에 의해 계산된다.

3.2.1 Scaled Dot-Product Attention

우리는 특정한 attention을 "Scaled Dot-Product Attention"이라고 부른다.
input은 dimension이 $d_k$ 인 queries와 keys, 그리고 $d_v$ 인 values로 구성된다.
우리는 query와 모든 keys들을 dot products하고,
가각을 $\sqrt{d_k}$ 로 나눈 뒤,
softmax function을 적용하여 values들의 weights를 얻는다.
실제로, 우리는 동시에 여러 query로 이루어진 set에 대해 attention function을 계산한다.
이들을 matrix $Q$ 로 함께 묶어서 표현한다.
keys와 values 역시 matrix $K$ 와 $V$ 로 함께 묶어서 나타낸다.
우리는 matrix of output을 다음과 같이 계산한다.
두 가지 가장 흔히 사용되는 attention functions으로
additive attention과 dot-product(multiplicative) attention이 있다.
Dot-product attention은 scaling factor $\frac{1}{\sqrt{d_k}}$ 을 곱해주는 것을 제외하고는 우리의 algorithm과 동일하다.
additive attention과 dot-product attention은 이론적으로 유사하지만,
실제로 dot-product attention이 optimized matrix multiplication code를 사용하여 구현할 수 있기 때문에 보다 빠르고 space-efficient하기 때문에 많이 사용된다.

3.2.2 Multi-Head Attention

$d_{model}$ -dimension의 key, value, query로 single attention function을 수행하는 대신,
query, keys 및 values에 대해 서로 다른 $h$ 번의 trained된 linear projection을 통해
$d_k, d_k$ 및 $d_v$ dimension으로 linearly projection하는 것이 유익하다고 알아냈다.
이러한 projection된 query, key 및 value의 각 version에 대해 병렬로 attention function을 수행하고,
이로부터 $d_v$ -dimensional output values를 얻는다.
이들은 concatenated되고 다시 projection되어 final value가 만들어진다.
Multi-head attention은 model이 서로 다른 representation subspace에 있는 information에 대해
함께 attention할 수 있도록 한다.
single attention head의 경우, averaging하는 것이 이를 억제한다.
(이해를 위한 그림 : https://codingopera.tistory.com/44)
- single-head attention mechanism :
- multi-head attention mechanism :

3.2.3 Application of Attention in our Model

Transformer는 multi-head attention을 세 가지 다른 방법으로 사용했다.
1. "encoder-decoder attention" layer에서 query는 이전 decoder layer에서 가져오며,
  memory key와 value 값은 encoder output에서 가져온다.
  이로써 decoder의 각 position이 input sequence의 모든 position을 attention할 수 있게 됨.
2. encoder에는 self-attention layers가 포함되어 있음.
  self-attentino layer에서 모든 key, value 및 query는 동일한 position에서 가져옴.
  이 경우에는 encoder의 이전 layer의 output에서 가져옴.
  encoder의 각 위치는 encoder의 이전 layer의 모든 position을 attention할 수 있음.
3. 마찬가지로, decoder의 self-attention layer는
  decoder의 각 position이 해당 position을 포함하여 decoder의 모든 position에 attention할 수 있도록 함.
  decoder에서 leftward information flow을 방지하여 auto-regressive property를 보존하기 위해
  leftward information flow를 막아야 함.
  우리는 scaled dot-product attention 내에서 이를 구현하고, 잘못된 connection에 해당하는 softmax의 input value들을 masking하여(setting to $-\infin$ ) 이를 수행함.
- leftward information flow란?
  (출처 : https://medium.com/lunit/transformers-in-computer-vision-an-overview-4fe00bf62297)
  MHSA(Multi-Head Self-Attention)와 비슷하게 decoder token에서 Q/K/V가 생성.
  MHSA와 다른 점은 중간에 mask가 들어간다는 점.
  sequence modeling의 경우 test시 순차적으로 출력을 생성하는데 (auto-regressive),
  MHSA구조를 사용하면 미래의 (아직 생성되지 않은) token을 참조하게 됨 — leftward information flow.
  Mask를 사용하여 leftward information flow를 막고, train/test의 간극을 없앨 수 있음.

3.3 Position-wise Feed-Forward Networks

encoder와 decoder의 각 layer에는 attention sub-layer에 추가로,
각 position에 대해 별도로 동일하게 적용되는 fully connected feed-forward network가 포함되어 있다.
이는 두 개의 linear transformation과 그 사이에 ReLU activation으로 구성된다.
$FFN(x) = max(0, xW_1+b_1)W_2 + b_2$

linear transformation은 서로 다른 position에 대해 동일하지만,
layer 간에는 다른 paramter를 사용한다.
이를 다른 방식으로 설명하면, kernel size가 1인 두 개의 convolution으로 볼 수 있다.
input과 output dimension은 $d_{model}=512$ 이며, inner-layer dimension은 $d_{ff}=2048$ 이다.

3.4 Embeddings and Softmax

다른 sequence transduction model들 처럼,
우리도 input tokens과 output tokens을 dimension이 $d_{model}$ 인 vector로 변환하기 위해
학습된 embedding을 사용했다.
또한 decoder output을 predicted next token probabilities로 변환하기 위해
일반적인 학습된 linear transformation and softmax function을 사용한다.
우리 model에서는 두 embedding layer와 pre-softmax linear transformation 사이에
동일한 weight matrix를 공유한다.
이 weight matrix는 embedding layer에서 $\sqrt{d_{model}}$ 로 곱해진다.

3.5 Positional Encoding

우리 model은 recurrence나 convolution을 포함하지 않으므로,
model이 sequence의 순서를 활용하려면 sequence 내 token의 상대적이거나 절대적인 위치에 대한 정보를 준비해야 한다.
이를 위해 우리는 encoder와 decoder stack의 맨 아래에 "positional encodings"을 input embedding에 추가하였다.
positional encoding은 embedding과 동일한 dimension인 $d_{model}$ 을 가지므로 두 값이 더해질 수 있다.
positional encoding의 선택지는 많이 있으며, learned and fixed된 것이 있다.
이 논문에서는 sine and cosine function을 사용하여
서로 다른 frequency의 positional encoding을 사용한다.

Architecture 정리, 요약

4 Why Self-Attention

self-attention layer가 recurrent와 convolution layer에 비해 어떠한 장점이 있는지 3가지 상황을 고려하여 설명할 것임.
1. One is the total computational complexity per layer
2. Another is the amount of computation that can be parallelized
3. The third is the path length between long-range dependencies in the nwork.
부가적인 이점으로,
self-attention은 더욱 interpretable(해석가능한) model을 제공할 수 있다.
우리는 model에서 얻은 attention distribution을 검토하고, appendix에서 discussiong할 것임.
또한 multi-head가 명확하게 서로 다른 task를 수행하는 것은 물론,
많은 attention head가 문장의 문법적 및 의미적 구조와 관련된 행동을 나타내는 것으로 보임.

5 Training

5.1 Training Data and Batching

5.2 Hardware and Schedule

5.3 Optimizer

5.4 Regularization

Residual Dropout

Label Smoothing

6 Results

6.1 Machine Translation

6.2 Model Variations

6.3 English Constituency Parsing

7 Conclusion

Hyungseop Lee

Efficient Deep Learning

다음 포스트

Attention Is All You Need

[Paper Review] Transformer for NLP

이해를 얻은 영상

Paper Info

몰랐던 용어, 사전지식

BLEU

Abstract

1. Introduction

2. Background

3 Model Architecture

3.1 Encoder and Decoder Stacks

Encoder

Decoder

3.2 Attention

3.2.1 Scaled Dot-Product Attention

3.2.2 Multi-Head Attention

3.2.3 Application of Attention in our Model

3.3 Position-wise Feed-Forward Networks

3.4 Embeddings and Softmax

3.5 Positional Encoding

4 Why Self-Attention

5 Training

5.1 Training Data and Batching

5.2 Hardware and Schedule

5.3 Optimizer

5.4 Regularization

Residual Dropout

Label Smoothing

6 Results

6.1 Machine Translation

6.2 Model Variations

6.3 English Constituency Parsing

7 Conclusion

(simple) [GPT-1] [GPT-2] [GPT-3]

0개의 댓글

관련 채용 정보