[Review] Transformer: Attention Is All You Need

redgreen·2022년 4월 13일

PaperReview

목록 보기

7/9

1. Introduction

기존 RNN이나 일부만 attention을 사용하던 구조에서 전체적으로 attention만을 사용한 구조로 등장하여 기존 모델 대비 높은 성능을 보여주었다.

연산량을 줄이기 위해 고안된 convolution기반의ConvS2S나 ByteNet의 연산량은 input과 output에 선형적으로 증가하거나(O(N)), 로그에 비례했지만(O(logN)) Transformer구조는 연산량을 상수 단위(O(k))로 줄였다고 한다.

2. Model architecture

1) Encoder

multi-head self-attention과 position-wise feed-forward로 sub-layer를 이루며 동일한 구조로 N(=6)번 반복된다.

각 sub-layer를 통과한 후에는 residual 구조와 LayerNorm을 적용해준다.
: $LayerNorm(x + sublayer(x))$

residual connection을 위해 모든 sub-layer와 embedding layer는 $d_{model}=512$ 크기의 output dimension을 가진다.

2) Decoder

Encoder와 같이 N(=6)번 반복되는 구조

Encoder의 output을 입력으로 받아 multi-head attention을 수행하는 sub-layer가 추가됨

Encoder에서와 마찬가지로 residual connection과 LayerNorm이 적용됨

이후의 position에 주의(attention)를 주는 것을 막기 위해 masked multi-head attention구조를 사용

$i$ 번째 예측을 위해서는 $i$ 번째 이전 position들만 사용

이후의 position들에는 -inf 값을 주어 softmax 계산시 0이 되도록 함

참고블로그: https://acdongpgm.tistory.com/221

3) Attention

query와 한쌍의 key-value를 mapping하여 attention output을 얻는 구조

query와 key를 점곱하여 weight를 얻고,

weight와 value를 곱해(=weighted sum) attention output을 구함

Scaled Dot-Product Attention

query와 key는 $d_k$ , value는 $d_v$ 의 차원을 가짐

구한 weighted value에 $\sqrt{d_k}$ 로 나누어줌으로써 Scaling을 함
-feed-forward network를 사용하는 additive attention도 있지만 속도와 공간 효율면에서 dot-product attention을 사용하였다고함.

scaling과정이 없을 경우, softmax 사용시 너무 작은 gradient를 만들어 낼 우려가 있어 추가하였다고 함.

Multi-Head Attention

attention head를 하나만 두는 것 보다, 여러개 둘 수록 성능이 더 좋다고 한다.

병렬적으로 weighted value를 구하고 concat - projection 과정을 통해 최종 output을 얻어낸다고 한다.

Multi-head attention은 다양한 representation subspace정보의 attention을 얻는데 도움이 된다고 한다.

Encoder-decoder attention

encoder(key, value)

decoder(query)

encoder에서 가져온 key, value와 결합한 덕분에 decoder는 input sequence의 position에 attention을 줄 수 있다.

4) ETC

Position-wise Feed-Forward Network

$W_1: (d_{model}, \; d_{ff})$
$W_2: (d_{ff}, \;d_{model})$
$d_{ff}=2048, \;d_{model}=512$

Embedding and Softmax

input/output token을 $d_{model}$ 차원으로 바꾸기 위해 learned embedding을 사용

decoder의 ouput에서 다음 토큰의 확률을 예측하기 위해서도 learned linear transform을 사용

위 두 부분에서 같은 가중치를 공유한다고 한다. 대신 embedding layer에서는 $\sqrt{d_{model}}$ 을 가중치로 곱해준다고 한다.

Positional Encoding

recurrence나 convolution구조가 없기 때문에 sequence정보를 모델에 추기 위해 토큰의 상대적/절대적 위치정보를 추가하였다고 한다.

cos대신 학습데이터보다 더 긴 sequence길이를 외삽할 수 있는 sin 정현파를 사용하였다고한다.

Regularization

Residual Dropout: sub-layer에서 residual connection 전에, positional embedding을 더한 후 적용했다고 한다. $P_{drop}$ =0.1

Label Smoothing: perplexity와 모델은 불안정해졌지만 BLEU score는 올랐다고 한다.

조금 더 이해하기

참고하면 좋은 블로그

1) input( $X$ )에 $W^Q, W^K, W^V$ 를 행렬곱해 $Q, K, V$ 를 얻어낸다.

2) 얻어낸 $Q$ 와 $K^T$ 를 행렬곱하여 weight를 만든다.

3) weight에 $\sqrt{d_k}$ 로 나누어 값이 너무 커지지 않도록 막고, softmax를 취해 합이 1이되는 weight를 만든다. 그 후 $V$ 와 행렬곱을 통해 가중합을 구한다.

4) 이렇게 만들어진 $Z$ 는 다음 block으로 전달된다.