๐Ÿ“„ Attention is All you need ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

์„œ์€์„œยท2023๋…„ 9์›” 5์ผ
0

Paper Review

๋ชฉ๋ก ๋ณด๊ธฐ
4/6

0. Abstract

์ฃผ์š” sequence transduction ๋ชจ๋ธ์€ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š” ๋ณต์žกํ•œ RNN ๋˜๋Š” CNN์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์˜ค์ง attention mechanism์—๋งŒ ๊ธฐ๋ฐ˜ํ•œ transfomer๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Š” ๋ณ‘๋ ฅ์ ์œผ๋กœ ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋” ๋น ๋ฅธ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. ๋˜ํ•œ transformer ๋Š” ๋‹ค๋ฅธ task์— ์ผ๋ฐ˜ํ™” ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.

1. Introduction

Recurrent model์€ input๊ณผ output ์‹œํ€€์Šค์˜ symbol position์„ ํ†ตํ•ด ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด๋Š” ์ด์ „ hidden states htโˆ’1h_{t-1}๊ณผ ์œ„์น˜ tt์˜ ํ•จ์ˆ˜์ธ hidden states hth_t๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ์ง„ํ–‰ ์‹œ ๊ธด sequence ๊ธธ์ด๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ, memory์™€ computation์—์„œ ์ œ์•ฝ์ด ์ƒ๊ธด๋‹ค.

Attention mechanisms์€ input ๋˜๋Š” output ์‹œํ€€์Šค์˜ distance์— ๊ด€๊ณ„์—†์ด dependency๋ฅผ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ recurrent network์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋œ์‹œ์— ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌํ™”๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.
๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” recurrence๋ฅผ ํšŒํ”ผํ•˜๋ฉด์„œ input๊ณผ output์‚ฌ์ด์˜ global dependencies์„ ๋ชจ๋ธ๋ง ํ•  ์ˆ˜ ์žˆ๋Š” Attention mechanism๋งŒ์„ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ Transformer์„ ์ œ์•ˆํ•œ๋‹ค.

2. Background

"Extended Neural GPU," "ByteNet," "ConvS2S"๊ณผ ๊ฐ™์€ ๋ชจ๋ธ์€ sequentialํ•œ ๊ณ„์‚ฐ์„ ์ค„์ด๊ธฐ์œ„ํ•ด CNN์„ ์‚ฌ์šฉํ•˜๋ฉฐ input๊ณผ outputdml position์— ๋Œ€ํ•œ ์ˆจ๊ฒจ์ง„ ํ‘œํ˜„์„ ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์—์„œ ์ž„์˜์˜ ์ž…๋ ฅ ๋˜๋Š” ์ถœ๋ ฅ ์œ„์น˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๊ธฐ๊ฐ€ ์–ด๋ ค์›Œ์ง„๋‹ค.
transformer๋Š” ์ด ๋ฌธ์ œ๋ฅผ ์ƒ์ˆ˜ ๊ฐœ์˜ ์—ฐ์‚ฐ์œผ๋กœ ์ค„์ด์ง€๋งŒ, ํ‰๊ท ํ™”๋œ ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜ ์œ„์น˜๋กœ ์ธํ•ด ํšจ๊ณผ์ ์ธ ํ•ด์ƒ๋„๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ๋ถ€์ž‘์šฉ์ด ์žˆ๋‹ค. ์ด ๋ถ€์ž‘์šฉ์„ "Multi-Head Attention"์„ ํ†ตํ•ด ๊ทน๋ณตํ•˜๊ณ ์ž ํ•œ๋‹ค.

  • Self-attention(intra-attention)
    ์‹œํ€€์Šค ๋‚ด์˜ ๋‹ค๋ฅธ ์œ„์น˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ด ์‹œํ€€์Šค์˜ ํ‘œํ˜„์„ ์ƒ์„ฑํ•˜๋Š” Attention mechanisms์ด๋‹ค. ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋…ํ•ด, ์ถ”์ƒ์  ์š”์•ฝ, ํ…์ŠคํŠธ ํฌํ•จ, ํ•™์Šต ๊ณผ์ œ, ๋…๋ฆฝ์ ์ธ ๋ฌธ์žฅ ํ‘œํ˜„์„ ํฌํ•จํ•œ ๋‹ค์–‘ํ•œ task์—์„œ ์„ฑ๊ณต์ ์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.
  • End-to-end memory network
    sequence-aligned recurrence๋ณด๋‹ค recurrent attention mechanism ๊ธฐ๋ฐ˜์„ ๋‘”๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” transformer๊ฐ€ sequence-aligned RNNs ํ˜น์€ ์ปจ๋ณผ๋ฃจ์…˜์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ์˜ ํ‘œํ˜„์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์™„์ „ํžˆ 'self-attention'์— ์˜์กดํ•˜๋Š” ๋ชจ๋ธ์ž„์„ ๋ณด์ธ๋‹ค.

3. Model Architecture

3.1 Encoder and Decoder Stacks

encoder์™€ decoder์—์„œ stacked self-attention๊ณผ poinwise๋ฅผ ์‚ฌ์šฉํ•œ ์™„์ „์—ฐ๊ฒฐ์ธต์„ ๊ตฌ์กฐ๋กœํ•œ๋‹ค.

  • Encoder : 6๊ฐœ์˜ ๋™์ผํ•œ layer๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.
    โ–ถ๏ธŽ ๊ฐ๊ฐ์˜ layer๋Š” ๋‘๊ฐœ์˜ sub-layer๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
    - 1. multi-head self-attention mechanism
    - 2. positionwise fully connected feed-forward network

  • Decoder : 6๊ฐœ์˜ ๋™์ผํ•œ layer๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.
    โ–ถ๏ธŽ ๊ฐ๊ฐ์˜ layer๋Š” ์„ธ๊ฐœ์˜ sub-layer๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
    - 1. multi-head self-attention mechanism
    - 2. positionwise fully connected feed-forward network
    - 3. encoder stack์˜ output์— ๋Œ€ํ•ด multi-head attention์„ ์ˆ˜ํ–‰

3.2 Attention

3.2.1 Scaled Dot-Product Attention

query, key, value?
Query(Q)๋Š” ๋ฌผ์–ด๋ณด๋Š” ์ฃผ์ฒด, Key(K)๋Š” ๋ฐ˜๋Œ€๋กœ Query์— ์˜ํ•ด ๋ฌผ์–ด๋ด„์„ ๋‹นํ•˜๋Š” ์ฃผ์ฒด, Value(V)๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ฐ’๋“ค์„ ์˜๋ฏธํ•œ๋‹ค.

๊ณ„์‚ฐ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Query(Q) : ์–ด๋–ค ๋‹จ์–ด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” vector
Key(K) : ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ vector๋“ค์„ ์Œ“์•„๋†“์€ ํ–‰๋ ฌ

๐Ÿ‘‰๐Ÿป ์ด ๋•Œ Q์™€ K์˜ ์ฐจ์›์€ dkd_k, V์˜ ์ฐจ์›์€ dvd_v์ด๋‹ค.

  • QKTQK^T๋ฅผ ํ†ตํ•ด Q์™€ K์˜ ๋‚ด์ ์„ ํ•ด์คŒ์œผ๋กœ์จ relation vector(Q์™€ K์˜ ๊ด€๋ จ ์ •๋ณด๊ฐ€ ๋“ค์–ด์žˆ์Œ)๋ฅผ ๋งŒ๋“ ๋‹ค.
    ๐Ÿ‘‰๐Ÿป ๋งˆ์ง€๋ง‰ softmaxํ•จ์ˆ˜๋ฅผ ์ด์šฉํ—ค Query ๋‹จ์–ด๊ฐ€ ๋ชจ๋“  ๋‹จ์–ด๋“ค๊ณผ ์–ด๋Š์ •๋„์˜ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€ ํ™•๋ฅ ๋ถ„ํฌํ˜•ํƒœ๋กœ ๋งŒ๋“ ๋‹ค.
  • dk\sqrt{d_k}๋กœ ๋‚˜๋ˆ„๋Š” ์ด์œ ๋Š” softmax๊ฐ€ 0 ๊ทผ๋ฐฉ์—์„œ gradient๊ฐ€ ๋†’๊ณ  0์—์„œ ๋ฉ€์–ด์งˆ์ˆ˜๋ก gradient๊ฐ€ ๋‚ฎ์•„์ง€๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต์ด ์ž˜ ๋˜์ง€ ์•Š๋Š”๋‹ค. ๋–„๋ฌธ์— ์ด๋ฅผ 0์™€ ๊ฐ€๊นŒ์ด ์Šค์ผ€์ผ๋ง ํ•จ์œผ๋กœ์จ ํ•™์Šต์ด ์ž˜๋˜๋„๋ก ํ•œ๋‹ค.

3.2.2 Multi-Head Attention

๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ๊ด€์ ์—์„œ ๋ฌธ์žฅ์„ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์—ญํ• ์ด๋‹ค.

queries์™€ keys, values๋ฅผ linear projection์„ ํ†ตํ•ด ์ค‘๊ฐ„์— ๋งคํ•‘ํ•ด ์—ฌ๋Ÿฌ๊ฐœ์˜ attention function๋“ค์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด single attention function๋ณด๋‹ค ํšจ์œจ์ ์ด๋‹ค. ๊ฐ๊ฐ์˜ attention์˜ ์ถœ๋ ฅ๊ฐ’์€ ์ „๋ถ€ ํ•ฉ์ณ์ง€๊ณ  linear function์„ ์ด์šฉํ•ด ๋งคํ•‘๋œ๋‹ค.

computational cost์ธก๋ฉด์—์„œ ๋ฐ”๋ผ๋ดค์„ ๋•Œ dk=dv=dmodel/hd_k = d_v=d_{model}/h์ด๊ธฐ ๋•Œ๋ฌธ์— single-head attention๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋น„์Šทํ•˜๋‹ค.

3.2.3 Applications of Attention in our Model

transformer์—์„œ๋Š” multi-head attention์„ ๋‹ค์Œ์˜ ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

  • encoder-decoder layer
    query๋Š” ์ด์ „ decoder layer์—์„œ ๋‚˜์˜ค๋ฉฐ, key์™€ value๋Š” ์ธ์ฝ”๋”์˜ ์ถœ๋ ฅ์—์„œ ๋‚˜์˜จ๋‹ค.
    โ–ถ๏ธŽ ์ด๋ฅผ ํ†ตํ•ด decoder์˜ ๊ฐ ์œ„์น˜๊ฐ€ ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋ชจ๋“  ์œ„์น˜์— ๋Œ€ํ•ด attention์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • encoder
    encoder์—๋Š” self-attention layer๊ฐ€ ํฌํ•จ๋˜์–ด์žˆ๋‹ค. key,query,value๋Š” ๋™์ผํ•œ ๊ณณ์—์„œ ๋‚˜์˜ค๋ฉฐ ์ด ๊ฒฝ์šฐ ์ด์ „ encoder layer์˜ ์ถœ๋ ฅ์—์„œ ๋„์ถœ๋œ๋‹ค.
    โ–ถ๏ธŽ ์ด๋ฅผ ํ†ตํ•ด encoder์˜ ๊ฐ ์œ„์น˜๊ฐ€ ์ด์ „ encoder ๋ ˆ์–ด์˜ ๋ชจ๋“  ์œ„์น˜์— ๋Œ€ํ•ด attention์„ ์ˆ˜ํ–‰ ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • decoder
    decoder์˜ self-attention layer์—์„œ ๊ฐ ์œ„์น˜๋Š” ํ•ด๋‹น ์œ„์น˜๋ฅผ ํฌํ•จํ•˜์—ฌ decoder ๋‚ด์˜ ๋ชจ๋“  ์œ„์น˜์— attention์„ ์ˆ˜ํ–‰ ํ•  ์ˆ˜ ์žˆ๋‹ค.

3.4 Embeddings and Softmax

input๊ณผ output์€ tokenizing์„ ์ง„ํ–‰ํ•˜๋ฉฐ embedding layer๋ฅผ ๊ฑฐ์ณ ๋‚˜์˜จ embedding vector๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. embedding vector๋Š” ๋ฌธ๋งฅ์„ ์ž˜ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ์œผ๋ฉฐ input embedding๊ณผ output embedding์—์„œ ๊ฐ€์ค‘์น˜ matrix๋Š” ์„œ๋กœ ๊ณต์œ ํ•œ๋‹ค.

3.5 Positional Encoding

transformer๋Š” attention mechanism๋งŒ์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— sequenceํ•œ ์ •๋ณด๋ฅผ ๋‹ด์•„๋‚ผ ์ˆ˜ ์—†๋‹ค. ๋”ฐ๋ผ์„œ ๋ณ„๋„์˜ 'positinal encoding'์„ ํ†ตํ•ด sequenceํ•œ ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค.
positinal encoding์€ sine๊ณผ cosine function์„ ์ด์šฉํ•ด ๋„์ถœํ•  ์ˆ˜ ์žˆ๋‹ค.

pos : position
i : dimension
โญ๏ธ ๊ณ ์ •๋œ offset k์— ๋Œ€ํ•ด PEpos+kPE_{pos+k}๊ฐ€ $$PE_{pos}##์˜ ์„ ํ˜•ํ•จ์ˆ˜๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๊ธฐ์— ๋ชจ๋ธ์ด ์‰ฝ๊ฒŒ ์ƒ๋Œ€์ ์ธ ์œ„์น˜๋ฅผ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋ผ ๊ฐ€์ •ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์œ„์˜ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.


์ถœ์ฒ˜

profile
๋‚ด์ผ์˜ ๋‚˜๋Š” ์˜ค๋Š˜๋ณด๋‹ค ๋” ๋‚˜์•„์ง€๊ธฐ๋ฅผ :D

0๊ฐœ์˜ ๋Œ“๊ธ€