[Review] Attention Is All You Need

YSLยท2023๋…„ 7์›” 10์ผ
0

Review

๋ชฉ๋ก ๋ณด๊ธฐ
2/7
post-thumbnail

๐Ÿ“Attention Is All You Need (์›๋ฌธ)

โ—๏ธ๊ฐœ๋…์„ ์ •๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ž‘์„ฑํ•œ ๊ธ€๋กœ, ๋‚ด์šฉ์ƒ ์ž˜๋ชป๋œ ๋ถ€๋ถ„์ด ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์  ์ฐธ๊ณ  ๋ฐ”๋ž๋‹ˆ๋‹ค.


๋…ผ๋ฌธ์—์„œ๋Š” ์ˆœ์„œ๊ฐ€ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.

  • Introduction
  • Background
  • Model Architecture
    • Encoder and Decoder Stacks
    • Attention
      • Scaled Dot-Product Attention
      • Multi-Head Attention
      • Application of Attention in our Model
      • Position-wise Feed Forward Networks
      • Embeddings and Softmax
      • Positional Encoding
  • Why Self-Attention
  • Training
    • Training Data and Batching
    • Hardware and Schedule
    • Optimizer
    • Regularization
  • Results
    • Machine Translation
    • Model Variations
  • Conclusion

์ด ๊ธ€์€ ๋…ผ๋ฌธ ์ˆœ์„œ๋ฅผ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ผ๊ฐ€๊ธฐ๋ณด๋‹ค๋Š” ๋‚ด๊ฐ€ ๊ณต๋ถ€ํ•  ๋•Œ ์ดํ•ดํ•˜๊ธฐ ํŽธํ–ˆ๋˜ ํ๋ฆ„๋Œ€๋กœ ์ž‘์„ฑํ•˜๋ ค๊ณ  ํ•œ๋‹ค.

๊ธฐ์กด Sequence-to-Sequence ๋ชจ๋ธ

+) sequence modeling : ์ž…๋ ฅ sequence๊ฐ€ ๋“ค์–ด์˜ค๋ฉด ๋˜๋‹ค๋ฅธ sequence๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” task

RNN, LSTM, GRU์™€ ๊ฐ™์€ seq2seq ๋ชจ๋ธ์€

  • ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ฒˆ์— ์ž…๋ ฅ ๋ฐ›์•„ ์ฒ˜๋ฆฌ (X)
  • sequence position tt์— ๋”ฐ๋ผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅ ๋ฐ›์•„ ์ฒ˜๋ฆฌ (O)

๋”ฐ๋ผ์„œ tt์‹œ์ ์˜ hidden state hth_t๊ฐ€ (t+1)(t+1)์‹œ์ ์˜ hidden state๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ์‚ฌ์šฉ๋œ๋‹ค. ์ด๋Š” ๋งค ๋‹จ์–ด๋งˆ๋‹ค ์ƒ์„ฑ๋˜๋Š” hidden state๊ฐ€ ๊ทธ ์ „ ์‹œ์ ์˜ sequence ์ •๋ณด๋ฅผ ํ•จ์ถ•ํ•˜๊ณ  ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ๋œ๋‹ค.

ํ•˜์ง€๋งŒ ์ž…๋ ฅ์„ ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐ›์•„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์•„๋ž˜์™€ ๊ฐ™์€ ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค.

  • sequence ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง€๋ฉด memory์™€ computation ๋ถ€๋‹ด ์ฆ๊ฐ€
  • ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•ด ์•ž์œผ๋กœ๋งŒ ์ „ํŒŒ โ†’ ๋ชจ๋“  ๋‹จ์–ด๋“ค ๊ฐ„์˜ ๊ด€๊ณ„์„ฑ ํŒŒ์•… ์–ด๋ ค์›€

ํŠนํžˆ sequence ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ ๋•Œ ๋ฐœ์ƒํ•˜๋Š” long-term dependency problem๊ฐ€ seq2seq ๋ชจ๋ธ์˜ ๊ฐ€์žฅ ํฐ ํ•œ๊ณ„์ด๋‹ค.

Transformer

Transformer๋Š” ๊ธฐ์กด seq2seq ๋ชจ๋ธ์˜ ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ๊ฐœ์„ ํ•˜๊ณ ์ž ๊ณ ์•ˆ๋˜์—ˆ๋‹ค. Attention ๋งค์ปค๋‹ˆ์ฆ˜๋งŒ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋ชจ๋ธ๋กœ, input๊ณผ output ๋ฐ์ดํ„ฐ์—์„œ sequence distance์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์„œ๋กœ๊ฐ„์˜ dependencies๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ๋งํ•œ๋‹ค.

+) Attention์€ transformer์—์„œ ์ฒ˜์Œ ์ œ์•ˆ๋œ ๊ฐœ๋…์ด ์•„๋‹ˆ๋‹ค. Transformer ์ด์ „์— RNN๊ณผ Attention์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ Transformer๊ฐ€ ์ฃผ๋ชฉ ๋ฐ›์€ ์ด์œ ๋Š” recurrent layer ์—†์ด attention ๋งค์ปค๋‹ˆ์ฆ˜๋งŒ์œผ๋กœ ๊ตฌ์„ฑ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

Attention
: ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ์ง‘์ค‘ํ•  ๋‹จ์–ด๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐฉ์‹
Encoder-Decoder ๊ตฌ์กฐ์—์„œ context vector๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์ •๋ณด๊ฐ€ ์†Œ์‹ค๋˜๋Š” ๋ฌธ์ œ๋ฅผ ๊ฐœ์„ ํ•˜๊ณ ์ž Attention์„ ๋„์ž…ํ–ˆ๋‹ค.
๐Ÿ“ 3. Attention [์ดˆ๋“ฑํ•™์ƒ๋„ ์ดํ•ดํ•˜๋Š” ์ž์—ฐ์–ด์ฒ˜๋ฆฌ]

Transformer Architecture๐Ÿ“š ๋ธ”๋กœ๊ทธ ์ฐธ์กฐ (Reference์— ์ถœ์ฒ˜ ํ‘œ์‹œํ•˜์˜€์Šต๋‹ˆ๋‹ค)

Transformer๋Š” seq2seq๋ชจ๋ธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, Encoder + Decoder ๊ตฌ์กฐ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค. ์ด๋•Œ Encoder์™€ Decoder์˜ ๋‚ด๋ถ€๋Š” recurrenct layer๋‚˜ convolution layer ์—†์ด self-attention layer์™€ fully connected layer๋กœ๋งŒ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.

Encoder

Input Embedding

๋‹จ์–ด๋ฅผ ์ˆซ์ž๊ฐ’์„ ๊ฐ–๋Š” ๋ฒกํ„ฐ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ด๋‹ค. ๋‹จ์–ด๋“ค ๊ฐ„์˜ dependencies๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด ๊ณผ์ •์„ ๊ฑฐ์ณ ์ˆซ์ž๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค.

Positional Embedding

Transformer๋Š” ์ž…๋ ฅ sequence๋ฅผ ํ•œ ๋ฒˆ์— ๋ฐ›์•„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์„œ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๋Š”๋‹ค. ํ•˜์ง€๋งŒ ๋ฒˆ์—ญ, ์š”์•ฝ, ๋‹จ์–ด ๋ถ„๋ฅ˜ ๋“ฑ์˜ task๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ ์ž…๋ ฅ๋˜๋Š” ๋‹จ์–ด์˜ ์ˆœ์„œ์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋ฐ”๊พธ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋นˆ๋ฒˆํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์„œ ์ •๋ณด๋Š” ๋ฐ˜๋“œ์‹œ ๊ณ ๋ คํ•ด์•ผ ํ•  ์‚ฌํ•ญ์ด๋‹ค.

๋”ฐ๋ผ์„œ ์ „์ฒด ์ž…๋ ฅ sequence์—์„œ ๋‹จ์–ด๋ณ„ ์œ„์น˜ ์ •๋ณด๋ฅผ ๊ฐ ๋‹จ์–ด์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์— ์ถ”๊ฐ€ํ•˜๊ณ ์ž positional encoding์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ positional encoding ๋ฐฉ์‹์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  • ์ง์ˆ˜๋ฒˆ์งธ ์›์†Œ = sin(pos100002idmodel)sin(\frac{pos}{10000^\frac{2i}{d_{model}}})
  • ํ™€์ˆ˜๋ฒˆ์งธ ์›์†Œ = cos(pos100002idmodel)cos(\frac{pos}{10000^\frac{2i}{d_{model}}})

์ฃผ๊ธฐํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ 
๐Ÿ“ Transformer์™€ Nerf์—์„œ์˜ Positional Encoding์˜ ์˜๋ฏธ์™€ ์‚ฌ์šฉ ๋ชฉ์ 

  • positional encodingํ•œ ๊ฒฐ๊ณผ ๊ฐ’์˜ ๋ฒ”์œ„๋ฅผ [-1, 1]๋กœ ์ œํ•œํ•˜์—ฌ ๋‹จ์–ด๊ฐ€ ๊ฐ€์ง€๋Š” ๊ธฐ์กด์˜ ์˜๋ฏธ์—์„œ ๋งŽ์ด ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด
  • ์ฃผ๊ธฐํ•จ์ˆ˜์˜ frequency๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ์คŒ์œผ๋กœ์จ ์ž„์˜์˜ ๋‘ ์œ„์น˜์— ๋Œ€ํ•ด positional encoding ๊ฒฐ๊ณผ๊ฐ€ ๊ฒน์น  ํ™•๋ฅ ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด
    ex) My favorite fruit is an Apple vs. Apples grown from seed tend to be very different from those of their parents, and the resultant fruit frequently lacks desired characteristics.
    โ†’ ๋‘ ๋ฌธ์žฅ์—์„œ Apple์— ๋Œ€ํ•ด positional encoding์„ ํ•  ๋•Œ, ๊ฐ™์€ ๋‹จ์–ด์ด์ง€๋งŒ ์œ„์น˜ ์ •๋ณด๊ฐ€ ๋งค์šฐ ๋‹ค๋ฅด๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ๊ธฐํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

Multi-Head Attention

1. Scaled Dot-Product Attention

Scaled Dot-Product Attention์ด ์ˆ˜ํ–‰๋˜๋Š” ๊ณผ์ •์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

step 1) (์–ด๋–ค ๋‹จ์–ด ๋ฒกํ„ฐ : Q) ยท (๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ๋ชจ์•„๋‘” ํ–‰๋ ฌ : K)T^T
โ†’ relation vector๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค.

step 2) QKTQK^T๋ฅผ dk\sqrt{d_k}๋กœ ๋‚˜๋ˆ  scaling
โ†’ ์ดํ›„ softmax ์‹œ ๋ฒกํ„ฐ ๋‚ด์˜ ๊ฐ’๋“ค์ด 0 ๊ทผ์ฒ˜๋กœ ๋ชจ์ด๋ฉด์„œ gradient๊ฐ€ ์ปค์ง€๋„๋ก ๋งŒ๋“ค์–ด์ค€๋‹ค.
(softmax ์‹œ ๊ฐ’์ด 1 ๊ทผ์ฒ˜์ด๋ฉด gradient๊ฐ€ ์ž‘์•„์ ธ ํ•™์Šต์ด ์ œ๋Œ€๋กœ ์ˆ˜ํ–‰๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ)

step 3) softmaxsoftmax
โ†’ ๋‹จ์–ด(Q)๊ฐ€ ๋ชจ๋“  ๋‹จ์–ด๋“ค๊ณผ ์–ด๋Š ์ •๋„ correlation์ด ์žˆ๋Š”์ง€ ํ™•๋ฅ ๋ถ„ํฌ์˜ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•ด์ค€๋‹ค.

step 4) softmax(QKTdk)softmax(\frac{QK^T}{\sqrt{d_k}})์— ํ–‰๋ ฌ VV๋ฅผ ๊ณฑํ•œ๋‹ค.

โ‡’ ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ธฐ์กด ๋ฒกํ„ฐ์— Q์™€ K ๊ฐ„์˜ correlation ์ •๋ณด๋ฅผ ๋”ํ•œ vector๊ฐ€ ์ƒ์„ฑ๋˜๊ณ , ์ด๊ฒƒ์ด ๋‹จ์–ด์˜ encoding vector์ด๋‹ค.

"Self" Attention
๐Ÿ“ 4-1. Transformer(Self Attention) [์ดˆ๋“ฑํ•™์ƒ๋„ ์ดํ•ดํ•˜๋Š” ์ž์—ฐ์–ด์ฒ˜๋ฆฌ]

  • Q, K, V์˜ ๊ฐ’์ด ๋™์ผํ•˜๋‹ค (X)
  • Q, K, V์˜ ์‹œ์ž‘ ๊ฐ’์ด ๋™์ผํ•˜๋‹ค (O)
    โ‡’ ๋™์ผํ•œ ์ž…๋ ฅ์„ ๋ฐ›์ง€๋งŒ ํ•™์Šต ๊ฐ€์ค‘์น˜ WQ^Q, WK^K, WV^V์— ๋”ฐ๋ผ ์ตœ์ข…์ ์œผ๋กœ ์ƒ์„ฑ๋˜๋Š” Q, K, V์˜ ๊ฐ’์€ ๊ฐ๊ฐ ๋‹ค๋ฅด๋‹ค.

2. Multi-Head Attention

๋™์ผํ•œ ์ž…๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์—ฌ๋Ÿฌ ๊ฐœ์˜ head๋กœ ๋‚˜๋ˆ„์–ด ๋™์‹œ์— ๋ณ‘๋ ฌ์ ์œผ๋กœ Self-Attention์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
๋”ฐ๋ผ์„œ Q, K, V์˜ ์‹œ์ž‘ ๊ฐ’์€ ๋™์ผํ•˜์ง€๋งŒ, ์ด ์ž…๋ ฅ์„ ์„ ํ˜•๋ณ€ํ™˜ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ WQ^Q, WK^K, WV^V๊ฐ€ ๊ฐ head๋ณ„๋กœ ๋‹ฌ๋ผ ๊ตฌํ•ด์ง€๋Š” encoding vector๊ฐ€ head๋ณ„๋กœ ๋‹ค๋ฅด๋‹ค.

์ด๋•Œ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด Q, K, V ํ–‰๋ ฌ์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š”๋ฐ, ๊ทธ ์‹์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
dQย =ย dKย =ย dVย =ย inputย dimensionnumย ofย headsd_Q\ =\ d_K\ =\ d_V\ =\ \frac{input\ dimension}{num\ of\ heads}


๋”ฐ๋ผ์„œ MHA๋ž€,

  • input ์ž์ฒด๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž‘์€ ์ฐจ์›์œผ๋กœ slice (X)
  • input์œผ๋กœ๋ถ€ํ„ฐ Q, K, V๋ฅผ ๋งŒ๋“œ๋Š” ๋ณ€ํ™˜ํ–‰๋ ฌ์ธ WQ^Q, WK^K, WV^V์˜ output ์ฐจ์›์„ ์ค„์—ฌ ๊ทธ Q, K, V์— ๊ด€ํ•ด self-attention์„ ์ˆ˜ํ–‰ (O)

โ‡’ ์–ด๋–ค ํ•œ ๋‹จ์–ด์— ๋Œ€ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ ๊ธฐ์ค€์œผ๋กœ ์—ฌ๋Ÿฌ ๊ด€์ ์—์„œ์˜ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๊ณ  ์ด ๊ณผ์ •์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

Residual Connection & Layer Normalization

Residual Connection (= Skip Connection)

sub layer ๊ฒฐ๊ณผ์— ์›๋ณธ input์„ ๋”ํ•ด์ฃผ์–ด ์ธต์ด ๊นŠ์–ด์งˆ์ˆ˜๋ก ์›๋ณธ input์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์†์‹ค๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•œ๋‹ค. ์ด ๊ฒฐ๊ณผ, ์—ญ์ „ํŒŒ ์‹œ gradient vanishing ๋ฌธ์ œ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

Layer Normalization

๊ฐ instance๋ณ„๋กœ feature์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ตฌํ•ด์„œ ๊ทธ feature ์ž์ฒด๋ฅผ ์ •๊ทœํ™”ํ•œ๋‹ค. ์ด ๊ฒฐ๊ณผ, ๋‹ค์ˆ˜์˜ sample์— ๋Œ€ํ•ด ํ‰๊ท  = 0, ๋ถ„์‚ฐ = 1์ธ ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

Feed Forward

FFN(x)ย =ย max(0,xw1+b1)w2+b2FFN(x)\ =\ max(0, xw_1+b_1)w_2+b_2
์•ž์„  layer๋“ค์—์„œ๋Š” '์„ ํ˜•๋ณ€ํ™˜'๋งŒ ์ˆ˜ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ํ™œ์„ฑํ™”ํ•จ์ˆ˜ ReLU๋ฅผ ํฌํ•จํ•˜์—ฌ ๋น„์„ ํ˜•์„ฑ์„ ์ถ”๊ฐ€ํ•ด์ค€๋‹ค.

Decoder

Masked Multi-Head Attention

Decoder์—์„œ self-attention์„ ์ˆ˜ํ–‰ํ•  ๋•Œ QKT^T์— masking์„ ์ ์šฉํ•œ๋‹ค. masking : ๋ฏธ๋ž˜ ์‹œ์  (t+ฮฑ)(t + \alpha) ๋‹จ์–ด ๋ถ€๋ถ„์ธ ํ–‰๋ ฌ ์ƒ๋‹จ์— โˆ’โˆž-โˆž๋กœ ์„ค์ •ํ•˜์—ฌ QKT^T์— ๊ณฑํ•ด์ค€๋‹ค.
โ‡’ Decoder์— ์ž…๋ ฅ๋˜๋Š” ๋ฌธ์žฅ์˜ ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•ด Q, K, V๋ฅผ ๋งŒ๋“ค์–ด์„œ ๋‹จ์–ด๋ณ„ encoding vector๋ฅผ ๋งŒ๋“ค ๋•Œ ์•ž ๋‹จ์–ด์˜ encoding vector๊ฐ€ ๋ฏธ๋ž˜ ์‹œ์ ์˜ ์ •๋ณด๋Š” ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๋„๋ก ํ•œ๋‹ค.

Multi-Head Attention

Encoder์˜ Q, K, V๋Š” ๋ชจ๋‘ ๋™์ผํ•œ ์ž…๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๋Š” ๋ฐ˜๋ฉด, Decoder์˜ Q, K, V๋Š” ์ž…๋ ฅ์ด ์„œ๋กœ ๋‹ค๋ฅด๋‹ค. ์ด ๊ฒฝ์šฐ๋ฅผ Encoder-Decoder attention์ด๋ผ ํ•œ๋‹ค.

Q : Decoder์˜ ์ด์ „ layer์—์„œ ์–ป์€ decoder embedding
K, V : Encoder์—์„œ ์–ป์€ embedding

โ‡’ ground truth์˜ ๊ฐ ๋‹จ์–ด๋“ค์ด encoder์˜ ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ์—์„œ ์–ด๋– ํ•œ ๋‹จ์–ด์— ๋” ์ฃผ๋ชฉํ•  ์ง€ ๊ตฌํ•˜๊ณ , ground truth ๋ฌธ์žฅ์˜ ๊ฐ ๋‹จ์–ด๋ฅผ encodingํ•  ๋•Œ ์ž…๋ ฅ ๋ฌธ์žฅ์—์„œ ์–ด๋–ค ๋‹จ์–ด์˜ encoding vector๋ฅผ ๋” ๋งŽ์ด ๋ฐ˜์˜ํ• ์ง€ ๊ฐ€์ค‘ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๊ณผ์ •์ด๋‹ค.

Training

Optimizer

Adam optimizer๋ฅผ ์‚ฌ์šฉํ•˜์˜€๊ณ , ํ•™์Šต๋ฅ  ์Šค์ผ€์ฅด๋Ÿฌ(lr scheduler)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋™์ ์œผ๋กœ ํ•™์Šต๋ฅ ์„ ๋ณ€ํ™”์‹œ์ผฐ๋‹ค.

  • ํ•™์Šต ์ดˆ๋ฐ˜ : lrlr ์ž‘๊ฒŒ โ†’ ๊ทน์†Œ์— ๋„๋‹ฌํ–ˆ์„๋•Œ ๋„ˆ๋ฌด ์‰ฝ๊ฒŒ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋„๋ก
  • ํ•™์Šต ์ค‘ iteration์ด ๋Š˜์–ด๋‚ ์ˆ˜๋ก : lrlr๋„ ๋น„๋ก€ํ•˜์—ฌ ์ฆ๊ฐ€ โ†’ ํ•™์Šต ์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋„๋ก
  • ํŠน์ • iteration ์ž„๊ณ„์ ์— ๋„๋‹ฌ : lrlr์„ ์„œ์„œํžˆ ์ค„์ž„ โ†’ global minima์— ๊ทผ์ ‘ํ–ˆ์Œ์—๋„ lrlr์ด ๋„ˆ๋ฌด ์ปค์„œ ์‰ฝ๊ฒŒ ๋น ์ ธ๋‚˜์˜ค๋Š” ํ˜„์ƒ์„ ๋ง‰๋„๋ก

Regularization

WW๊ฐ€ ๋„ˆ๋ฌด ํฐ ๊ฐ’์„ ๊ฐ€์ง€์ง€ ์•Š๋„๋ก (= ๋ชจ๋ธ ๋ณต์žก๋„๋ฅผ ๋‚ฎ์ถ”๋„๋ก) ํ•˜์—ฌ overfitting์„ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด 3๊ฐ€์ง€ ๊ทœ์ œ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

Residual Dropout

  1. ๊ฐ sub layer(self-attention layer, FFN layer)์˜ output์— dropout ์ ์šฉํ•˜๊ธฐ
  2. embedding ๋ฒกํ„ฐ์™€ positional encoding ๋ฒกํ„ฐ์˜ ํ•ฉ(summation)์— dropout ์ ์šฉํ•˜๊ธฐ

Label Smoothing

  1. ํ•™์Šต ๊ณผ์ • ์ค‘ label smoothing ์ ์šฉํ•˜๊ธฐ

    label smoothing
    ๐Ÿ“ ๋ผ๋ฒจ ์Šค๋ฌด๋”ฉ(Label smoothing), When Does Label Smoothing Help?
    : hard label (0 ๋˜๋Š” 1๋กœ binary encoding๋œ label) โ†’ soft label (0 ~ 1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ–๋Š” label)๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ์‹


Why Self-Attention

  1. the total computational complexity per layer (layer ๋‹น ์ „์ฒด ๊ณ„์‚ฐ ๋ณต์žก๋„)
    sequence length nn < representation dimensionality dd์ด๋ฉด self-attention์ด RNN๋ณด๋‹ค complexity๊ฐ€ ๋‚ฎ๋‹ค. ์ด๋•Œ, ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ๊ฐ€ nn < dd์— ํ•ด๋‹นํ•œ๋‹ค.

  2. the amount of computation that can be parallelized (sequential ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•œ ๊ณ„์‚ฐ๋Ÿ‰)

    • RNN : input์„ ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐ›์•„ nn๋ฒˆ์˜ RNN cell์„ ๊ฑฐ์นจ โ†’ O(n)O(n)
    • Self-Attention : input์˜ ๋ชจ๋“  position ๊ฐ’๋“ค์„ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌ โ†’ O(1)O(1)
  3. the path length between long-range dependencies in the network

    • Self-Attention : ๊ฐ token์ด ๋ชจ๋“  token๋“ค๊ณผ ์ฐธ์กฐํ•˜์—ฌ ๊ทธ correlation ์ •๋ณด๋ฅผ ๊ตฌํ•ด์„œ ๋”ํ•จ
      โ†’ max path length = O(1)O(1)
      ์ƒ์ˆ˜ ์‹œ๊ฐ„์œผ๋กœ ๋งค์šฐ ์ž‘๊ธฐ ๋•Œ๋ฌธ์— long-range dependencies๋ฅผ ์‰ฝ๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.

References

๐Ÿ“ Attention is all you need paper ๋ฝ€๊ฐœ๊ธฐ
๐Ÿ“ [๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Transformer ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ (Attention Is All You Need)
๐Ÿ“ [Paper review] Attention ์„ค๋ช… + Attention Is All You Need ๋ฆฌ๋ทฐ
๐Ÿ“ [Paper] Attention is All You Need ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
๐Ÿ“ [๋…ผ๋ฌธ์ •๋ฆฌ] Attention is all you need
๐Ÿ“ Attention Is All You Need ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
๐Ÿ“ Self-Attention is not typical Attention model
๐Ÿ“ [๋…ผ๋ฌธ ์Šคํ„ฐ๋”” Week 4-5] Attention is All You Need

0๊ฐœ์˜ ๋Œ“๊ธ€