(2017)Attention Is All You Need

Gyuha Park·2021년 8월 20일

Classification Deep Learning transformer

Paper Review

목록 보기

23/34

0. Abstract

현재 대부분의 sequence model은 CNN이나 RNN을 encoder, decoder로써 활용하고 있다. 그 중 가장 좋은 성능을 보이는 모델은 attention mechanism을 활용한 encoder, decoder모델이다. 본 논문은 CNN과 RNN을 제거하고 attention에만 기반을 둔 단순한 network인 Transformer를 제안한다. 이를 통해 병렬 처리가 가능해졌고 학습 시간을 대폭 감소 시켰다.

1. Introduction

Attention mechanism은 input output seq의 길이에 관계없이 dependency를 모델링 할 수 있게 해줌으로써 seq modeling에 필수적인 요소가 되었다. 그러나 대부분 이 방법은 아직 RNN과 결합되는 형태로 쓰이고 있다. 본 논문의 모델인 Transformer는 대신 attention mechanism에 전적으로 기반한다. 이를 통해 input과 output의 global dependency를 잡아낸다. 이는 학습 속도도 빠르고 병렬 처리도 우수하다.

2. Background

sequential한 연산을 줄이고자 하는 목표로 ByteNet등이 생겨났는데, 이들은 모두 hidden representation을 parallel하게 계산하고자, CNN을 활용한다. 그러나 이들은 2개의 position 상 멀리 떨어져 있는 input, output을 연결 하는데 많은 연산을 필요로 한다. 따라서 distant position에 있는 dependency를 학습하기에는 힘들다. Transformer에서는 attention-weighted position을 평균 취해줌으로써 effective는 잃었지만, 이 operation이 상수로 고정되어있다. effective에 대해선 multi-Head Attention으로 이를 극복한다. Self-attention은 seq representation을 얻고자 한 sequence에 있는 다른 position을 연결해주는 attention기법이다. 이는 지문 이해나 요약 등의 과제에서 다양하게 활용되고 있다. 그러나 RNN이나 CNN없이 self-attention만으로 representation을 구한 모델은 본 논문의 transformer가 처음이다.

3. Model Architecture

가장 뛰어난 sequence model은 encoder-decoder구조를 활용한다. encoder가 symbol representation $(x_1,...,x_n)$ 을 가지고 있는 input sequence를 연속적인 representation $z=(z_1,...,z_n)$ 으로 바꿔준다. 그리고 그 $z$ 를 가지고 decoder가 순차적으로 symbol을 가진 output sequence $(y_1,...,y_m)$ 을 만들어낸다. 이때 symbol을 만드는 과정은 auto-regressive한데, 각 $y_i$ 를 만들어내는 단계에서는 이전의 만들어진 symbol도 input으로 사용한다는 의미이다. Transformer는 stack, point-wise self-attention과 encoder, decoder 구조 모두를 가진 구조이다.

1) Encoder and Decoder Stacks

Encoder:

N=6개의 동일한 layer 구조가 stack되어 있다.
각 layer는 2개의 (sub-layer multi head attention, point-wise fc layer)를 가지고 있다.
하나는 multi-head self-attention mechanism이고, 하나는 position별로 fully connected된 단순한 feed-forward network이다.
각 sub-layer에는 residual connection을 연결하였고, 이후 normalization을 하였다.
residual connection을 수월하게 하기 위해 $d_{model}=512$ 로 고정하였다.

Decoder:

Decoder 역시 6개의 동일한 layer를 stack하였다.
그러나 2개의 sub-layer외에도 또 다른 sub-layer를 추가하였는데, encoder stack의 output에 multi-head attention을 수행하는 layer이다.
동일하게 각 sub-layer에 residual connection을 해주었고 layer normalization을 해주었다.
또한, decoder stack의 self-attention에서 subsequent position에 attending하는 것을 막기 위해 masking을 추가한다. 이는 i번째 position의 prediction이 i번째 이전의 output에만 의존할 수 있도록 만들어준다.

2) Attention

Attention function은 query와 key, value의 pair를 토대로 output을 만들어 반환해주는 함수이다. output은 value에 대한 weighted sum으로 구해지는데, 이 weight는 key에 대한 query의 compatibility function을 수행해 얻어진다.

$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$

본 논문의 attention은 Scaled Dot-Product Attention이다. Attention의 대표적인 방법으로는 additive attention과 dot-product attention이 있는데, transformer는 이중 후자에 scaling factor를 추가하여 사용한다. 즉, key와 query의 score function으로 가중치를 구한 후, softmax를 적용하여 normalize한다. $d_k$ (key dimension)가 커질수록 dot product가 커지고, softmax function을 small gradient를 가지는 구간으로 몰아붙이게 된다. 이러한 부작용을 없애기 위해 $1/\sqrt{d_k}$ 로 scaling한다.

어떤 부분에 attention을 할지, 즉 weight를 정하기 위해선 다음의 과정을 거친다. query와 key를 내적을 하고, 이를 key의 $d_k$ 의 제곱근으로 나누어줘 scaling을 해준다. 최종적으로 이를 softmax를 통하여, value에 곱해줄 weight를 구한다. 실제로는 여러 개의 query와 그에 따른 key, value를 묶어서 각각 Q,K,V라는 matrix를 만들고 이를 matrix계산을 한다.

query와 key, value를 $d_k$ 차원에 linear projection을 시켜서 각각의 projected된 h개의 query, key, value를 가지고 h번의 attention을 한다. 그리고 이 h개의 head를 다시 concat하고 다시 project을 시켜 최종 값을 얻는다. 이렇게 함으로써 각 position의 다른 subspace에서의 representation 정보도 고려할 수 있게 된다.

Transformer는 multi-head attention을 3가지 방법으로 사용하였다.

query는 이전 decoder layer에서 오고, memory key와 value는 encoder의 output이다. encoder의 모든 위치에서 input sequence의 전체 position에 대해 attend가 가능하다.
self-attention layer에서 query, key, value는 encoder의 이전 layer에서 나온 output이다.
디코더에도 self-attention layer가 있다. Auto regressive의 성질을 유지해주기 위해 left방향의 information은 막았다.

3) Position-wise Feed-Forward Networks

Attention sub-layer를 통과한 후 fully connected feed-forward network를 통과한다. ReLU를 사이에 linear transformation으로 구성된다. Attention sub-layer말고도 인코더 디코더는 모든 position에 동일하게 적용되는 fully connected layer도 있다. 이것은 ReLU를 포함한 2개의 선형 변환으로 이루어져 있다.

$FFN(x)=\max(0,xW_1+b_1)W_2+b_2$

물론 각 layer마다 파라미터는 다르다. 이걸 표현하는 또 다른 방식은 kernel size1의 2 convolution이다. input과 output차원은 512이고, hidden layer의 차원은 2048이다.

4) Embedding and Softmax

대부분의 seq transduction model들과 같이 input output token을 벡터로 만들어주는데 learned embedding을 사용하였다. Decoder를 통해 나온 representation은 fully connected layer와 softmax를 거쳐 다음 token의 probability로 나온다.

5) Positional Encoding

Transformer 모델이 RNN이나 CNN이 없기 때문에, sequence order 정보를 이용할 수 있도록 상대적인, 혹은 절대적인 position 정보를 넣어줘야 한다. 이를 위해 인코더 디코더의 bottom의 input embedding에 positional encoding을 추가하였다. positional encoding은 임베딩 차원과 동일하며 따라서 합쳐질 수 있다. positional encoding을 할 수 있는 방법은 많으나 cosine function을 사용하였다.

$PE_{(pose,2i)}=sin(pose/10000^{2i/d_{model}})$

$PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}})$

pos는 단어의 position이고 i는 $d_{model}$ 의 index이다. 즉 각 positional encoding의 차원이 sin곡선을 가진다는 것이다. 이렇게 함으로써 $k$ , $PE_{pose+k}$ 가 $PE_{pose}$ 의 linear funtion이 되므로 relative position의 정보를 배울 수 있을 것이라 가정하였다. 또한 이렇게 함으로써 training때 없었던 길이의 sequence가 들어와도 잘 작동할 수 있을 것이라 생각하였다.

4. Training

WMT 2014 English-German dataset을 학습시킬 땐 sentence를 byte-pair encoding으로 인코딩하였다. WMT 2015 English-Franch를 학습시킬때는 toekn을 word-piece로 쪼개어서 학습하였다.
8개의 P100 GPU로 base model의 경우 12시간(100K step), large model의 경우 3.5일(300K step)이 걸렸다.
warmup을 적용한 adam optimizer를 사용하였다.
dropout, label smoothing의 regularization 기법을 사용하였다.

5. Results

WMT 2014 Eng-Ger에서 big transformer model이 앙상블을 포함한 이전 모델을 2.0 BLEU score로 앞섰다. base모델 역시 training 비용을 고려 하였을 때 이전 모델들과 견줄만하다. WMT 2014 Eng-French에서는 big model이 이전의 다른 single model보다 training 비용은 1/4로 줄었음에도 BLEU는 더 좋았다. training cost는 학습 시간과 사용된 GPU수, 각 GPU의 연산 능력을 곱하여 추정하였다.

6. Conclusion

본 논문의 모델은 attention에만 기반을 둔 Encoder-Decoder 모델이다. Translation task에서, RNN이나 CNN보다 훨씬 빠르게 학습될 수 있다. Eng-Ger에선 앙상블 모델보다 성능이 뛰어났다.

7. Transformer 동작원리

Transformer 동작 원리(한 단어)

1) Query, Key, Value

임베딩 차원( $d_{model}$ ) → query, key, value 차원( $d_{model}/h$ )

위 그림에서 $d_{model}=4,\ h=2$

2) Scaled Dot-Product Attention

$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$ 공식을 이용해 attention 값을 구한다.
Transformer 동작 원리(행렬)

1) Query, Key, Value

행렬인 경우의 예시이다 동일하게 $d_{model}=4,\ h=2$ 인 경우 각각의 query, key, value가 생성된다.

2) Scaled Dot-Product Attention

구한 query와 key를 dot-product로 한번에 계산해서 attention energy를 얻는다. 마지막으로 value와 행렬곱을 계산해 attention을 얻는다.

Mask matrix는 특정한 단어를 무시할 수 있도록 한다. Mask 값으로 음수의 무한대 값을 넣어 softmax 함수의 출력이 0에 가까워지도록 한다.
Multi-Head Attention

Head 마다 query, key, value를 구해 이어 붙여서 weight를 곱하면 결과적으로 입력 차원과 같게 된다.