[๐Ÿ“–๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Attention Is All You Need (2017)

Becky's Study Labยท2023๋…„ 12์›” 8์ผ
0

PaperReview

๋ชฉ๋ก ๋ณด๊ธฐ
4/24

NLP task, architecture์˜ ์—ญ์‚ฌ๋Š” Transformer ์ด์ „๊ณผ ์ดํ›„๋กœ ๋ด๋„ ๋  ๋งŒํผ, Transfomer, self-attention ์— ๋Œ€ํ•œ ๋‚ด์šฉ์€ ๊ผญ ์•Œ์•„์•ผ ํ•˜๋Š” ๋‚ด์šฉ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ์˜ค๋Š˜ ์ด ํฌ์ŠคํŒ…์„ ํ•˜๋ฉด์„œ Transformer์— ๋Œ€ํ•ด์„œ ๋‹ค์‹œ ํ•œ ๋ฒˆ ์ œ๋Œ€๋กœ ๊ณผ์ •์„ ๊ณต๋ถ€ํ•˜๊ณ  ์ •๋ฆฌํ•˜์˜€๋‹ค.
์ด ์—ฐ๊ตฌ ๋˜ํ•œ Google Brain, Google Research์—์„œ ์ˆ˜ํ–‰๋œ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋ผ๊ณ  ๋…ผ๋ฌธ์— ๊ธฐ์žฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

0. Abstract

  • ์ฃผ์š” sequence ๋ชจ๋ธ์€ Encoder์™€ Decoder์„ ํฌํ•จํ•˜๋Š” RNN, CNN์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ.
  • (๋‹น์‹œ) ์ตœ๊ณ  ์„ฑ๋Šฅ์˜ ๋ชจ๋ธ์€ attention mechanism์„ ํ†ตํ•ด Encoder์™€ Decoder๋ฅผ ์—ฐ๊ฒฐํ—˜ => Seq2Seq+Attention ๋ชจ๋ธ

์•„๋ž˜ ๊ทธ๋ฆผ์ด Seq2Seq+Attention ๋ชจ๋ธ์ž„!

recurrence ๊ณผ convolutions์„ ์™„์ „ํžˆ ์—†์• ๊ณ , attention mechanism์—๋งŒ ๊ธฐ๋ฐ˜ํ•œ ์ƒˆ๋กœ์šด ๊ฐ„๋‹จํ•œ ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜์ธ Transformer๋ฅผ ์ œ์•ˆ
=> RNN๊ณผ Attention์„ ํ•จ๊ป˜ ์‚ฌ์šฉํ–ˆ๋˜ Seq2Seq+Attention ๋ชจ๋ธ์—์„œ, RNN์„ ๊ฑท์–ด๋‚ด๊ณ  Attention ์—ฐ์‚ฐ๋งŒ์„ ์‚ฌ์šฉ

  • ๊ธฐ๊ณ„ ๋ฒˆ์—ญ task์— ๋Œ€ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ
    : ๋ชจ๋ธ์ด ํ’ˆ์งˆ๋ฉด์—์„œ ์šฐ์ˆ˜ํ•˜๋ฉด์„œ๋„ ๋ณ‘๋ ฌํ™”๊ฐ€ ๋” ์šฉ์ดํ•˜๊ณ  ํ•™์Šต์— ์†Œ์š”๋˜๋Š” ์‹œ๊ฐ„์ด ํ›จ์”ฌ ์ ์€ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚จ.
    [WMT 2014 ์˜์–ด-๋…์ผ์–ด ๋ฒˆ์—ญ ์ž‘์—…] - 28.4 BLEU๋ฅผ ๋‹ฌ์„ฑ (์•™์ƒ๋ธ”์„ ํฌํ•จํ•œ ๊ธฐ์กด ์ตœ๊ณ  ๊ฒฐ๊ณผ๋ณด๋‹ค 2 BLEU ์ด์ƒ ํ–ฅ์ƒ)
    [WMT 2014 ์˜์–ด-ํ”„๋ž‘์Šค์–ด ๋ฒˆ์—ญ ์ž‘์—…] - 8๊ฐœ์˜ GPU์—์„œ 3.5์ผ ๋™์•ˆ ํ›ˆ๋ จํ•œ ํ›„ ์ƒˆ๋กœ์šด ๋‹จ์ผ ๋ชจ๋ธ ์ตœ์ฒจ๋‹จ BLEU ์ ์ˆ˜ 41.8์„ ๋‹ฌ์„ฑ
    : Transformer๊ฐ€ large and limited training data๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜์–ด ์„ ๊ฑฐ๊ตฌ ๊ตฌ๋ฌธ ๋ถ„์„์— ์„ฑ๊ณต์ ์œผ๋กœ ์ ์šฉํ•จ์œผ๋กœ์จ ๋‹ค๋ฅธ ์ž‘์—…์— ์ž˜ ์ผ๋ฐ˜ํ™”๋œ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•จ

๐Ÿค” Attention ์ด๋ž€?

encoder์—์„œ ์ผ์ • ํฌ๊ธฐ๋กœ ๋ชจ๋“  ์‹œํ€€์Šค ์ •๋ณด๋ฅผ ์••์ถ•ํ•˜์—ฌ ํ‘œํ˜„ํ•˜๋ ค๊ณ  ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ณด์†์‹ค์ด ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ๋ณด์™„ํ•˜๊ณ ์ž Attention ๋ชจ๋ธ์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค.

Attention ๋ชจ๋ธ์€ decoder๊ฐ€ ๋‹จ์ˆœํžˆ encoder์˜ ์••์ถ•๋œ ์ •๋ณด๋งŒ์„ ๋ฐ›์•„ ์˜ˆ์ธก ์‹œํ€€์Šค๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, decoder๊ฐ€ ์ถœ๋ ฅ๋˜๋Š” ์‹œ์ ๋งˆ๋‹ค encoder์—์„œ์˜ ์ „์ฒด ์ž…๋ ฅ ๋ฌธ์žฅ์„ ํ•œ๋ฒˆ๋” ๊ฒ€ํ† ํ•˜๋„๋ก ํ•œ๋‹ค. ์ด ๋•Œ, decoder๋Š” encoder์˜ ๋ชจ๋“  ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ๋™์ผํ•œ ๊ฐ€์ค‘์น˜๋กœ ๋ฐ›์•„๋“ค์ด์ง€ ์•Š๊ณ , ์ค‘์š”ํ•œ ๋‹จ์–ด์— ๋Œ€ํ•˜์—ฌ ๋” ํฐ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์–ด ์ค‘์š”์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋„๋ก ํ•œ๋‹ค. ์ฆ‰, encoder์—์„œ ์ค‘์š”ํ•œ ๋‹จ์–ด์— ์ง‘์ค‘ํ•˜์—ฌ ์ด๋ฅผ decoder์— ๋ฐ”๋กœ ์ „๋‹ฌํ•˜๋„๋ก ํ•œ๋‹ค.

Attention(Q, K, V) = Attention value
Value += similarity(Q, K)
=> Attention value

"์–ดํ…์…˜ ํ•จ์ˆ˜๋Š” ์ฃผ์–ด์ง„ '์ฟผ๋ฆฌ(Query)'์— ๋Œ€ํ•ด์„œ ๋ชจ๋“  'ํ‚ค(Key)'์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ฐ๊ฐ ๊ตฌํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ตฌํ•ด๋‚ธ ์ด ์œ ์‚ฌ๋„๋ฅผ ํ‚ค์™€ ๋งตํ•‘๋˜์–ด์žˆ๋Š” ๊ฐ๊ฐ์˜ '๊ฐ’(Value)'์— ๋ฐ˜์˜ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์œ ์‚ฌ๋„๊ฐ€ ๋ฐ˜์˜๋œ '๊ฐ’(Value)'์„ ๋ชจ๋‘ ๋”ํ•ด์„œ ๋ฆฌํ„ดํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ์ด๋ฅผ ์–ดํ…์…˜ ๊ฐ’(Attention Value)์ด๋ผ๊ณ  ํ•œ๋‹ค."

Qeury์— ๋Œ€ํ•˜์—ฌ ๋ชจ๋“  Key์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๊ณ , ํ•ด๋‹น ์œ ์‚ฌ๋„๋ฅผ Key์— ๋งคํ•‘๋œ Value์— ๋ฐ˜์˜ํ•œ๋‹ค.
๋ชจ๋“  Value ๊ฐ’๋“ค์„ ๋”ํ•˜์—ฌ attention value๊ฐ’์„ ์–ป๊ฒŒ ๋œ๋‹ค.


โž• (ex) Seq2Seq + Attention ๋ชจ๋ธ์˜ ๊ณผ์ •

Attention ๊ธฐ๋ฒ•์€ ๋‹ค์–‘ํ•˜๊ฒŒ ์žˆ๋Š”๋ฐ, ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ Attention ๊ฐœ๋…์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด dot-product attention๋ฅผ ์ ์šฉํ•œ ๋ชจ๋ธ์˜ ๊ณผ์ •์„ ๋ณด์—ฌ์ฃผ๋ ค๊ณ  ํ•œ๋‹ค.

Q (Query): decoder์—์„œ t ์‹œ์ ์˜ ์€๋‹‰ ๋ฒกํ„ฐ (sts_t)
K (Key): encoder์—์„œ ๋ชจ๋“  ์‹œ์ ์— ๋Œ€ํ•œ ์€๋‹‰ ๋ฒกํ„ฐ (hih_i)
V (Value): encoder์—์„œ ๋ชจ๋“  ์‹œ์ ์˜ ์€๋‹‰ ๋ฒกํ„ฐ ๊ฐ’

(์ „์ฒด ๊ณผ์ •์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.)

Step1) Attention score ๊ตฌํ•˜๊ธฐ

dot-product attention์—์„œ๋Š” ์ด ์Šค์ฝ”์–ด ๊ฐ’์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด sts_t๋ฅผ ์ „์น˜(transpose)ํ•˜๊ณ  ๊ฐ ์€๋‹‰ ์ƒํƒœ์™€ ๋‚ด์ (dot product)์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ชจ๋“  ์–ดํ…์…˜ ์Šค์ฝ”์–ด ๊ฐ’์€ ์Šค์นผ๋ผ์ด๋‹ค.

Step2) Softmax ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด Attention Distribution๋ฅผ ๊ตฌํ•˜๊ธฐ

ete^t์— ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ, ๋ชจ๋“  ๊ฐ’์„ ํ•ฉํ•˜๋ฉด 1์ด ๋˜๋Š” ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์–ป์–ด๋ƒ…๋‹ˆ๋‹ค. ์ด๋ฅผ ์–ดํ…์…˜ ๋ถ„ํฌ(Attention Distribution)๋ผ๊ณ  ํ•˜๋ฉฐ, ๊ฐ๊ฐ์˜ ๊ฐ’์€ ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜(Attention Weight)๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ ์–ป์€ ์ถœ๋ ฅ๊ฐ’์ธ I, am, a, student์˜ ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ๊ฐ 0.1, 0.4, 0.1, 0.4๋ผ๊ณ  ํ•ฉ์‹œ๋‹ค. ์ด๋“ค์˜ ํ•ฉ์€ 1์ž…๋‹ˆ๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ์€ ๊ฐ ์ธ์ฝ”๋”์˜ ์€๋‹‰ ์ƒํƒœ์—์„œ์˜ ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜์˜ ํฌ๊ธฐ๋ฅผ ์ง์‚ฌ๊ฐํ˜•์˜ ํฌ๊ธฐ๋ฅผ ํ†ตํ•ด ์‹œ๊ฐํ™”ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ฆ‰, ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜๊ฐ€ ํด์ˆ˜๋ก ์ง์‚ฌ๊ฐํ˜•์ด ํฝ๋‹ˆ๋‹ค.

Step3) ๊ฐ ์ธ์ฝ”๋”์˜ ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜์™€ ์€๋‹‰ ์ƒํƒœ๋ฅผ ๊ฐ€์ค‘ํ•ฉํ•˜์—ฌ Attention Value ๊ตฌํ•˜๊ธฐ

Step4) ์–ดํ…์…˜ ๊ฐ’๊ณผ ๋””์ฝ”๋”์˜ t ์‹œ์ ์˜ ์€๋‹‰ ์ƒํƒœ๋ฅผ ์—ฐ๊ฒฐ (Concatenate)

Step5) ์ถœ๋ ฅ์ธต ์—ฐ์‚ฐ์˜ ์ž…๋ ฅ์ด ๋˜๋Š” st~\tilde{s_t}๋ฅผ ๊ณ„์‚ฐ

Step5) st~\tilde{s_t}๋ฅผ ์ถœ๋ ฅ์ธต์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ

๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ์–ดํ…์…˜(Attention)

์•ž์„œ Seq2Seq + Attention ๋ชจ๋ธ์— ์“ฐ์ผ ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ์–ดํ…์…˜ ์ข…๋ฅ˜๊ฐ€ ์žˆ์ง€๋งŒ, dot-product attetntion๊ณผ ๋‹ค๋ฅธ ์–ดํ…์…˜๋“ค์˜ ์ฐจ์ด๋Š” ์ค‘๊ฐ„ ์ˆ˜์‹์˜ ์ฐจ์ด๋‹ค. ์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” ์ค‘๊ฐ„ ์ˆ˜์‹์€ attention score ํ•จ์ˆ˜๋ฅผ ๋งํ•œ๋‹ค. ์œ„์˜ attention์ด dot-product attention์ธ ์ด์œ ๋Š” attention score๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋‚ด์ ์ด์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

์–ดํ…์…˜ ์Šค์ฝ”์–ด๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์—ฌ๋Ÿฌ๊ฐ€์ง€๊ฐ€ ์ œ์‹œ๋˜์–ด์žˆ์œผ๋ฉฐ, ํ˜„์žฌ ์ œ์‹œ๋œ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ์–ดํ…์…˜ ์Šค์ฝ”์–ด ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.


1. Introduction

Recurrent ๋ชจ๋ธ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ์‹œํ€€์Šค์˜ ๊ธฐํ˜ธ ์œ„์น˜๋ฅผ ๋”ฐ๋ผ ๊ณ„์‚ฐ์„ ๊ณ ๋ คํ•œ๋‹ค. ๊ณ„์‚ฐ ์‹œ๊ฐ„์˜ ๋‹จ๊ณ„์— ๋”ฐ๋ผ ์œ„์น˜๋ฅผ ์ •๋ ฌํ•˜์—ฌ ์ด์ „ ์ˆจ๊ฒจ์ง„ ์ƒํƒœ์˜ ํ•จ์ˆ˜(htโˆ’1h_{t-1})๊ณผ t ์‹œ์ ์˜ ์ž…๋ ฅ์— ๋”ฐ๋ผ์„œ์ผ๋ จ์˜ hidden state(hth_t)๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

์ด๋Ÿฌํ•œ ๋ณธ์งˆ์ ์ธ ์ˆœ์ฐจ์  ํŠน์„ฑ์œผ๋กœ ์ธํ•ด ํ›ˆ๋ จ์—์„œ ๋ณ‘๋ ฌํ™”๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ์˜ˆ์ œ ๊ฐ„ ์ผ๊ด„ ์ฒ˜๋ฆฌ๊ฐ€ ์ œํ•œ๋˜๋ฏ€๋กœ ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง์— ์ œ์•ฝ์ด ์žˆ๋‹ค.

โ— Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ๊ฐ•๋ ฅํ•œ ์‹œํ€€์Šค ๋ชจ๋ธ๋ง ๋ฐ ๋ณ€ํ™˜ ๋ชจ๋ธ์˜ ํ•„์ˆ˜์ ์ธ ๋ถ€๋ถ„์ด ๋˜์—ˆ์œผ๋ฉฐ, ์ž…๋ ฅ ๋˜๋Š” ์ถœ๋ ฅ ์‹œํ€€์Šค์˜ ๊ฑฐ๋ฆฌ์— ๊ด€๊ณ„์—†์ด timestamp์— ๋Œ€ํ•œ ์ข…์†์„ฑ์„ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ RNN๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋œ๋‹ค.

โ— ์ด ์—ฐ๊ตฌ์—์„œ ์šฐ๋ฆฌ๋Š” ๋ฐ˜๋ณต์„ ํ”ผํ•˜๊ณ  ๋Œ€์‹  Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์— ์ „์ ์œผ๋กœ ์˜์กดํ•˜์—ฌ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ์‚ฌ์ด์˜ global dependencies(์ „์—ญ ์ข…์†์„ฑ)์„ ๊ทธ๋ฆฌ๋Š” ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์ธ Transformer๋ฅผ ์ œ์•ˆํ•œ๋‹ค.
โ— Transformer๋Š” ํ›จ์”ฌ ๋” ๋งŽ์€ ๋ณ‘๋ ฌํ™”๋ฅผ ํ—ˆ์šฉํ•˜๋ฉฐ 8๊ฐœ์˜ P100 GPU์—์„œ ๋‹จ 12์‹œ๊ฐ„ ๋™์•ˆ ๊ต์œก์„ ๋ฐ›์€ ํ›„ ๋ฒˆ์—ญ ํ’ˆ์งˆ์—์„œ ์ƒˆ๋กœ์šด ์ตœ์ฒจ๋‹จ ๊ธฐ์ˆ ์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

2. Background

  • ์ˆœ์ฐจ ๊ณ„์‚ฐ์„ ์ค„์ด๊ณ ์ž Extended Neural GPU (extendedngpu,), ByteNet (NalBytenet2017,) and ConvS2S (JonasFaceNet2017,), ๋ชจ๋ธ๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๊ณ , ๋ชจ๋‘ ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง์„ ๊ธฐ๋ณธ ๋นŒ๋”ฉ ๋ธ”๋ก์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋“  ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ์œ„์น˜์— ๋Œ€ํ•ด hidden representations์„ ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐํ•จ
    => ํ•˜์ง€๋งŒ, ๋จผ ์œ„์น˜ ๊ฐ„์˜ ์ข…์†์„ฑ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์— ์–ด๋ ค์›€ ์žˆ์Œ
    => ์ž„์˜์˜ ๋‘ ์ž…๋ ฅ ๋˜๋Š” ์ถœ๋ ฅ ์œ„์น˜์˜ ์‹ ํ˜ธ๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ์ž‘์—… ์ˆ˜๊ฐ€ ์œ„์น˜ ๊ฐ„ ๊ฑฐ๋ฆฌ์— ๋”ฐ๋ผ ConvS2S์˜ ๊ฒฝ์šฐ ์„ ํ˜•์ ์œผ๋กœ, ByteNet์˜ ๊ฒฝ์šฐ ๋กœ๊ทธ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•จ.

  • Transformer์—์„œ๋Š” ์ด๋Š” ์ผ์ •ํ•œ ์ˆ˜์˜ ์ž‘์—…์œผ๋กœ ๊ฐ์†Œ๋ฉ๋‹ˆ๋‹ค. cost of reduced effective resolution due to averaging attention-weighted positions๋Š” Multi-Head Attention์œผ๋กœ ์ด์— ๋Œ€์‘ํ•จ

3. Model Architecture

Transformer๋Š” ์™ผ์ชฝ๊ณผ ์˜ค๋ฅธ์ชฝ์— ๊ฐ๊ฐ ํ‘œ์‹œ๋œ ๊ฒƒ์ฒ˜๋Ÿผ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ๋ชจ๋‘์— ๋Œ€ํ•ด ๋ˆ„์ ๋œ self-attention ๋ ˆ์ด์–ด์™€ ํฌ์ธํŠธ๋ณ„ ์™„์ „ ์—ฐ๊ฒฐ ๋ ˆ์ด์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ „์ฒด ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

3.1. Encoder and Decoder Stacks

โœ… Encoder ๊ตฌ์„ฑ

  • N = 6 ์˜ ๋™์ผํ•œ ๋ ˆ์ด์–ด๊ฐ€ ์Šคํƒ๋œ ๊ตฌ์กฐ
  • ๊ฐ ๋ ˆ์ด์–ด์—๋Š” ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋ ˆ์ด์–ด๋กœ ๊ตฌ์„ฑ๋จ.
    ๐Ÿ”ธ ์ฒซ ๋ฒˆ์งธ๋Š” Multi-Head self-attention ๋ฉ”์ปค๋‹ˆ์ฆ˜
    (Multi-Head self-attention์€ self-attention์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์‚ฌ์šฉํ–ˆ๋‹ค๋Š” ์˜๋ฏธ)
    ๐Ÿ”ธ ๋‘ ๋ฒˆ์งธ๋Š” position-wise fully connected feed-forward network(feed-forward network)๋กœ, residual connection์„ ์‚ฌ์šฉ
  • ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ๋ ˆ์ด์–ด ๊ฐ๊ฐ ์ฃผ์œ„์— LayerNorm(x+sublayer(x), sublayer() = function implemented by the sub-layer itself) ์ฒ˜๋ฆฌํ•จ
  • residual connection์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์˜ ๋ชจ๋“  ํ•˜์œ„ ๋ ˆ์ด์–ด์™€ ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋Š” 512 ์ฐจ์›์˜ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•จ.

โœ… Decoder ๊ตฌ์„ฑ

  • N = 6 ์˜ ๋™์ผํ•œ ๋ ˆ์ด์–ด๊ฐ€ ์Šคํƒ๋œ ๊ตฌ์กฐ
  • Encoder output์— ๋Œ€ํ•ด Multi-Head attention๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ ˆ์ด์–ด ์กด์žฌ
  • Encoder์™€ ์œ ์‚ฌํ•˜๊ฒŒ ๊ฐ ํ•˜์œ„ ๊ณ„์ธต ์ฃผ์œ„์— residual connection์„ ์‚ฌ์šฉํ•˜๊ณ  ์ด์–ด์„œ LayerNorm์„์ˆ˜ํ–‰
  • Masking ์ฒ˜๋ฆฌ๋œ Multi-Head attention ์กด์žฌ (์ดํ›„ ์ˆœ์„œ์˜ ๊ฐ’์ด ์ด์ „ ์ˆœ์„œ ์˜ˆ์ธก์— ์‚ฌ์šฉ๋˜์ง€ ์•Š๋„๋ก, ๋˜ํ•œ the predictions for position ii can depend only on the known outputs at positions less than ii => ii๋ฒˆ์งธ ์œ„์น˜์˜ ์˜ˆ์ธก ๊ฐ’์€ ii ์ด์ „์˜ ๊ฒฐ๊ณผ๋ฌผ๋“ค์—๋งŒ ์˜ํ–ฅ์„ ๋ฐ›๋„๋ก ํ•จ

3.2. Attention ๐Ÿ’ก

Attention function์€ Query์™€ key-value ์Œ ์„ธํŠธ๋ฅผ output์— ๋งคํ•‘ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค.
โ— Query, Key, Value ๋ฐ Output์€ ๋ชจ๋‘ vector!
โ— Output์€ Value์˜ ๊ฐ€์ค‘์น˜ ํ•ฉ์œผ๋กœ ๊ณ„์‚ฐ!
โ— ์—ฌ๊ธฐ์„œ ๊ฐ Value์— ํ• ๋‹น๋œ ๊ฐ€์ค‘์น˜๋Š” ํ•ด๋‹น Key์— ๋Œ€ํ•œ Query์˜ compatibility function์— ์˜ํ•ด ๊ณ„์‚ฐ!

3.2.1. Scaled Dot-Product Attention

  • queries and keys of dimension ์ธ dKd_K๋ฅผ ์‚ฌ์šฉํ•ด dK\sqrt{d_K}๋กœ ๋‚˜๋ˆ ์ฃผ๋ฉด์„œ ๋‚ด์  ์–ดํ…์…˜ ๊ฐ’์„ ์Šค์ผ€์ผ๋ง ํ•ด์ค€๋‹ค.
  • input ์‹œํ€€์Šค ๊ธธ์ด์— ์˜ํ•ด์„œ attention ๊ฐ’์ด ๋ฌดํ•œํžˆ ์ปค์ง€๋Š” ๊ฒƒ์„ ์Šค์ผ€์ผ๋ง ํ•œ๋‹ค.
  • ์•ž์„œ ์–ธ๊ธ‰ํ•˜์˜€๋“ฏ์ด ๋…ผ๋ฌธ์—์„œ dKd_K๋Š” dmodeld_{model}/num_heads ๋ผ๋Š” ์‹์— ๋”ฐ๋ผ์„œ 64์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฏ€๋กœ dK\sqrt{d_K} ๋Š” 8์˜ ๊ฐ’์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
  • dKd_K ์ƒ๊ด€์—†์ด ๋ถ„์‚ฐ ๊ฐ’์ด ๊ฐ™์•„์ง.

    ์ข€ ๋” ๊ตฌ์ฒด์ ์œผ๋กœ ์„ค๋ช…ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

3.2.2. Multi-Head Attention

  • h๊ฐœ์˜ Scaled Dot-Product Attention๋กœ multi-head attention ์ด ๊ตฌ์„ฑ๋œ๋‹ค.

  • ์ฆ‰, transformer์€ ์—ฌ๋Ÿฌ๋ฒˆ์˜ attention ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ => input์„ ์—ฌ๋Ÿฌ๋ฒˆ ์ดํ•ดํ•˜๋Š” ํšจ๊ณผ

  • ์•ž์„œ ์…€ํ”„ ์–ดํ…์…˜์€ ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์„ ๊ฐ€์ง€๊ณ  ์ˆ˜ํ–‰ํ•œ๋‹ค๊ณ  ํ•˜์˜€๋Š”๋ฐ, ์‚ฌ์‹ค ์…€ํ”„ ์–ดํ…์…˜์€ ์ธ์ฝ”๋”์˜ ์ดˆ๊ธฐ ์ž…๋ ฅ์ธ dmodeld_{model} ์˜ ์ฐจ์›์„ ๊ฐ€์ง€๋Š” ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์…€ํ”„ ์–ดํ…์…˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์šฐ์„  ๊ฐ ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค๋กœ๋ถ€ํ„ฐ Q๋ฒกํ„ฐ, K๋ฒกํ„ฐ, V๋ฒกํ„ฐ๋ฅผ ์–ป๋Š” ์ž‘์—…์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค.

  • ์ด๋•Œ ์ด Q๋ฒกํ„ฐ, K๋ฒกํ„ฐ, V๋ฒกํ„ฐ๋“ค์€ ์ดˆ๊ธฐ ์ž…๋ ฅ์ธ dmodeld_{model}์˜ ์ฐจ์›์„ ๊ฐ€์ง€๋Š” ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค๋ณด๋‹ค ๋” ์ž‘์€ ์ฐจ์›์„ ๊ฐ€์ง€๋Š”๋ฐ, ๋…ผ๋ฌธ์—์„œ๋Š” dmodeld_{model} =512์˜ ์ฐจ์›์„ ๊ฐ€์กŒ๋˜ ๊ฐ ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์„ 64์˜ ์ฐจ์›์„ ๊ฐ€์ง€๋Š” Q๋ฒกํ„ฐ, K๋ฒกํ„ฐ, V๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  • 64๋ผ๋Š” ๊ฐ’์€ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๋˜ ๋‹ค๋ฅธ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ธ numheads๋กœ ์ธํ•ด ๊ฒฐ์ •๋˜๋Š”๋ฐ, ํŠธ๋žœ์Šคํฌ๋จธ๋Š” dmodeld_{model}์„ numheads๋กœ ๋‚˜๋ˆˆ ๊ฐ’์„ ๊ฐ Q๋ฒกํ„ฐ, K๋ฒกํ„ฐ, V๋ฒกํ„ฐ์˜ ์ฐจ์›์œผ๋กœ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” numheads๋ฅผ 8๋กœํ•˜์˜€์Šต๋‹ˆ๋‹ค.


ํŠธ๋žœ์Šคํฌ๋จธ ์—ฐ๊ตฌ์ง„์€ ํ•œ ๋ฒˆ์˜ ์–ดํ…์…˜์„ ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์—ฌ๋Ÿฌ๋ฒˆ์˜ ์–ดํ…์…˜์„ ๋ณ‘๋ ฌ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ํšจ๊ณผ์ ์ด๋ผ๊ณ  ํŒ๋‹จํ–ˆ0๋‹ค.

๊ทธ๋ž˜์„œ dmodeld_{model}์˜ ์ฐจ์›์„ numheads๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด, dmodeld_{model}/numheads์˜ ์ฐจ์›์„ ๊ฐ€์ง€๋Š” Q, K, V์— ๋Œ€ํ•ด์„œ
numheads๊ฐœ์˜ ๋ณ‘๋ ฌ ์–ดํ…์…˜์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
๋…ผ๋ฌธ์—์„œ๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ธ numheads์˜ ๊ฐ’์„ 8๋กœ ์ง€์ •ํ•˜์˜€๊ณ , 8๊ฐœ์˜ ๋ณ‘๋ ฌ ์–ดํ…์…˜์ด ์ด๋ฃจ์–ด์ง€๊ฒŒ ๋œ๋‹ค. ๋‹ค์‹œ ๋งํ•ด ์œ„์—์„œ ์„ค๋ช…ํ•œ ์–ดํ…์…˜์ด 8๊ฐœ๋กœ ๋ณ‘๋ ฌ๋กœ ์ด๋ฃจ์–ด์ง€๊ฒŒ ๋˜๋Š”๋ฐ, ์ด๋•Œ ๊ฐ๊ฐ์˜ ์–ดํ…์…˜ ๊ฐ’ ํ–‰๋ ฌ์„ ์–ดํ…์…˜ ํ—ค๋“œ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.
์ด๋•Œ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ WQ{W^Q},WK{W^K},WV{W^V} ์˜ ๊ฐ’์€ 8๊ฐœ์˜ ์–ดํ…์…˜ ํ—ค๋“œ๋งˆ๋‹ค ์ „๋ถ€ ๋‹ค๋ฅด๋‹ค.


๋ณ‘๋ ฌ ์–ดํ…์…˜์„ ๋ชจ๋‘ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค๋ฉด ๋ชจ๋“  ์–ดํ…์…˜ ํ—ค๋“œ๋ฅผ ์—ฐ๊ฒฐ(concatenate)ํ•œ๋‹ค. ๋ชจ๋‘ ์—ฐ๊ฒฐ๋œ ์–ดํ…์…˜ ํ—ค๋“œ ํ–‰๋ ฌ์˜ ํฌ๊ธฐ๋Š” (seq-len, dmodeld_{model})๊ฐ€ ๋œ๋‹ค.

๊ฐ ํ—ค๋“œ์˜ ์ฐจ์›์ด ์ค„์–ด๋“ค๊ธฐ ๋•Œ๋ฌธ์— ์ด ๊ณ„์‚ฐ ๋น„์šฉ์€ ์ „์ฒด ์ฐจ์›์„ ๊ฐ–์ถ˜ ๋‹จ์ผ ํ—ค๋“œ ์–ดํ…์…˜์˜ ๋น„์šฉ๊ณผ ์œ ์‚ฌํ•ด์ง„๋‹ค.

์–ดํ…์…˜ ํ—ค๋“œ๋ฅผ ๋ชจ๋‘ ์—ฐ๊ฒฐํ•œ ํ–‰๋ ฌ์€ ๋˜ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W0W^0
์„ ๊ณฑํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, ์ด๋ ‡๊ฒŒ ๋‚˜์˜จ ๊ฒฐ๊ณผ ํ–‰๋ ฌ์ด ๋ฉ€ํ‹ฐ-ํ—ค๋“œ ์–ดํ…์…˜์˜ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์ด๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ์€ ์–ดํ…์…˜ ํ—ค๋“œ๋ฅผ ๋ชจ๋‘ ์—ฐ๊ฒฐํ•œ ํ–‰๋ ฌ์ด ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W0W^0 ๊ณผ ๊ณฑํ•ด์ง€๋Š” ๊ณผ์ •์„ ๋ณด์—ฌ์ค€๋‹ค. ์ด๋•Œ ๊ฒฐ๊ณผ๋ฌผ์ธ ๋ฉ€ํ‹ฐ-ํ—ค๋“œ ์–ดํ…์…˜ ํ–‰๋ ฌ์€ ์ธ์ฝ”๋”์˜ ์ž…๋ ฅ์ด์—ˆ๋˜ ๋ฌธ์žฅ ํ–‰๋ ฌ์˜ (seq-len, dmodeld_{model}) ํฌ๊ธฐ์™€ ๋™์ผํ•˜๋‹ค.

3.2.3. Applications of Attention in our Model

transformer์—์„œ๋Š” Multi-Head Attention layer๊ฐ€ 3๊ฐœ ์กด์žฌํ•˜๊ณ , ๊ฐ๊ฐ์˜ ํŠน์ง•์ด ๋‹ค๋ฅด๋‹ค.

1) encoder-decoder attention layer

์ด์ „ decoder ๋ ˆ์ด์–ด์—์„œ ์˜ค๋Š” query๋“ค๊ณผ encoder์˜ ์ถœ๋ ฅ์œผ๋กœ ๋‚˜์˜ค๋Š” memory key, value๋“ค๊ณผ์˜ attention์ž„
๐Ÿ‘‰๐Ÿป ์ด๋Š” decoder์˜ ๋ชจ๋“  ์œ„์น˜์—์„œ input sequence์˜ ๋ชจ๋“  ์œ„์น˜๋ฅผ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•จ

2) self-attention layer in Encoder

Encoder์˜ self-attention ๋ ˆ์ด์–ด์—์„œ๋Š” ๋ชจ๋“  ํ‚ค, ๊ฐ’ ๋ฐ ์ฟผ๋ฆฌ๊ฐ€ ๋™์ผํ•œ ์œ„์น˜(์ด ๊ฒฝ์šฐ ์ธ์ฝ”๋”์˜ ์ด์ „ ๋ ˆ์ด์–ด ์ถœ๋ ฅ)์—์„œ ๋‚˜์˜ด
๐Ÿ‘‰๐Ÿป Encoder์˜ ๊ฐ ์œ„์น˜๋“ค์€ ์ด์ „ ๋ ˆ์ด์–ด์˜ ๋ชจ๋“  ์œ„์น˜๋“ค์„ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ์Œ

3) self-attention layer in Decoder

Decoder์˜ self-attention ๋ ˆ์ด์–ด๋Š” ๋””์ฝ”๋”์˜ ๊ฐ ์œ„์น˜๊ฐ€ ํ•ด๋‹น ์œ„์น˜๊นŒ์ง€ ํฌํ•จํ•˜๋Š” Decoder์˜ ๋ชจ๋“  ์œ„์น˜์— ์ฃผ์˜๋ฅผ ๊ธฐ์šธ์ด๋Š” ๊ฒƒ์„ ํ—ˆ์šฉ
๐Ÿ‘‰๐Ÿป ํ•˜์ง€๋งŒ, Auto-Regressive ํŠน์ง• ์‚ด๋ฆฌ๊ณ ์ž ์ด์ „ ์œ„์น˜๋ถ€ํ„ฐ ์ž์‹  ์œ„์น˜๊นŒ์ง€๋งŒ์„ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ์Œ(Masking ๋ฐฉ๋ฒ• ์‚ฌ์šฉ)

3.3. Position-wise Feed-Forward Networks

๋ชจ๋“  Attention layer์™€ ํ•จ๊ป˜ fully connected feed-forward network๊ฐ€ ์‚ฌ์šฉ๋จ, ์ธ์ฝ”๋” ๋ฐ ๋””์ฝ”๋”์˜ ๊ฐ ๊ณ„์ธต์— ๊ฐœ๋ณ„์ ์œผ๋กœ ์œ„์น˜ํ•จ(Position-wise FFNN์€ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์—์„œ ๊ณตํ†ต์ ์œผ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์„œ๋ธŒ์ธต์ž„)



์—ฌ๊ธฐ์„œ xx๋Š” ์•ž์„œ ๋ฉ€ํ‹ฐ ํ—ค๋“œ ์–ดํ…์…˜์˜ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ (seq-len, dmodeld_{model})์˜ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€๋Š” ํ–‰๋ ฌ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W1W_1์€ (dmodeld_{model}, dffd_{ff})์˜ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€๊ณ , ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W2W_2์€ (dffd_{ff}, dmodeld_{model})์˜ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์€๋‹‰์ธต์˜ ํฌ๊ธฐ์ธ dffd_{ff}๋Š” ์•ž์„œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ •์˜ํ•  ๋•Œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด 2,048์˜ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
์—ฌ๊ธฐ์„œ ๋งค๊ฐœ๋ณ€์ˆ˜ W1W_1,b1b_1, W2W_2, b2b_2 ๋Š” ํ•˜๋‚˜์˜ ์ธ์ฝ”๋” ์ธต ๋‚ด์—์„œ๋Š” ๋‹ค๋ฅธ ๋ฌธ์žฅ, ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค๋งˆ๋‹ค ์ •ํ™•ํ•˜๊ฒŒ ๋™์ผํ•˜๊ฒŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ธ์ฝ”๋” ์ธต๋งˆ๋‹ค๋Š” ๋‹ค๋ฅธ ๊ฐ’์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

  • ์‚ฌ์ด์— ReLU ํ™œ์„ฑํ™”๊ฐ€ ์žˆ๋Š” ๋‘ ๊ฐœ์˜ ์„ ํ˜• ๋ณ€ํ™˜์œผ๋กœ ๊ตฌ์„ฑ
  • input๊ณผ output์˜ ์ฐจ์›์€ 512, ์€๋‹‰์ธต์˜ ์ฐจ์›์€ 2048

๊ทธ๋ฆผ์œผ๋กœ ์ƒ๊ฐํ•˜๋ฉด,
Encoder๋Š” ์ด๋Ÿฐ ๊ตฌ์กฐ์ธ ๊ฒƒ์ด๋‹ค.

โ—Add & Norm

๊ทธ๋ฆฌ๊ณ  ์ถ”๊ฐ€๋กœ ์ž”์ฐจ ์—ฐ๊ฒฐ(Residual connection)๊ณผ ์ธต ์ •๊ทœํ™”(Layer Normalization)๋„ ์ถ”๊ฐ€๋˜์–ด์„œ Encoder๋ฅผ ๊ตฌ์„ฑํ•˜๋Š”๋ฐ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

1) ์ž”์ฐจ ์—ฐ๊ฒฐ(Residual connection)


์ž”์ฐจ ์—ฐ๊ฒฐ์€ ์„œ๋ธŒ์ธต์˜ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์„ ๋”ํ•˜๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค ์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด ํŠธ๋žœ์Šคํฌ๋จธ์—์„œ ์„œ๋ธŒ์ธต์˜ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์€ ๋™์ผํ•œ ์ฐจ์›์„ ๊ฐ–๊ณ  ์žˆ์œผ๋ฏ€๋กœ, ์„œ๋ธŒ์ธต์˜ ์ž…๋ ฅ๊ณผ ์„œ๋ธŒ์ธต์˜ ์ถœ๋ ฅ์€ ๋ง์…ˆ ์—ฐ์‚ฐ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ์ด ๋ฐ”๋กœ ์œ„์˜ ์ธ์ฝ”๋” ๊ทธ๋ฆผ์—์„œ ๊ฐ ํ™”์‚ดํ‘œ๊ฐ€ ์„œ๋ธŒ์ธต์˜ ์ž…๋ ฅ์—์„œ ์ถœ๋ ฅ์œผ๋กœ ํ–ฅํ•˜๋„๋ก ๊ทธ๋ ค์กŒ๋˜ ์ด์œ ๋‹ค. ์ž”์ฐจ ์—ฐ๊ฒฐ์€ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์˜ ํ•™์Šต์„ ๋•๋Š” ๊ธฐ๋ฒ•์ด๋‹ค.

์„œ๋ธŒ์ธต์ด ๋ฉ€ํ‹ฐ ํ—ค๋“œ ์–ดํ…์…˜์ด์—ˆ๋‹ค๋ฉด ์ž”์ฐจ ์—ฐ๊ฒฐ ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

H(x)=x+Multiโˆ’headย Attention(x)H(x) = x+Multi-head\ Attention(x)

2) ์ธต ์ •๊ทœํ™”(Layer Normalization)

์ž”์ฐจ ์—ฐ๊ฒฐ์„ ๊ฑฐ์นœ ๊ฒฐ๊ณผ๋Š” ์ด์–ด์„œ ์ธต ์ •๊ทœํ™” ๊ณผ์ •์„ ๊ฑฐ์น˜๊ฒŒ๋ฉ๋‹ˆ๋‹ค.

์ธต ์ •๊ทœํ™”๋Š” ํ…์„œ์˜ ๋งˆ์ง€๋ง‰ ์ฐจ์›์— ๋Œ€ํ•ด์„œ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ตฌํ•˜๊ณ , ์ด๋ฅผ ๊ฐ€์ง€๊ณ  ์–ด๋–ค ์ˆ˜์‹์„ ํ†ตํ•ด ๊ฐ’์„ ์ •๊ทœํ™”ํ•˜์—ฌ ํ•™์Šต์„ ๋•๋Š”๋‹ค. ์—ฌ๊ธฐ์„œ ํ…์„œ์˜ ๋งˆ์ง€๋ง‰ ์ฐจ์›์ด๋ž€ ๊ฒƒ์€ ํŠธ๋žœ์Šคํฌ๋จธ์—์„œ๋Š” dmodeld_{model} ์ฐจ์›์„ ์˜๋ฏธํ•œ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์€ dmodeld_{model}์ฐจ์›์˜ ๋ฐฉํ–ฅ์„ ํ™”์‚ดํ‘œ๋กœ ํ‘œํ˜„ํ•˜์˜€๋‹ค.


3.4. Embeddings and Softmax

  • ๋‹ค๋ฅธ ์‹œํ€€์Šค ๋ณ€ํ™˜ ๋ชจ๋ธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ํ•™์Šต๋œ ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ํ† ํฐ๊ณผ ์ถœ๋ ฅ ํ† ํฐ์„ dmodeld_{model}์ฐจ์› ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
  • ์ผ๋ฐ˜์ ์œผ๋กœ ํ•™์Šต๋œ linear transformation ๋ฐ Softmax function๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋””์ฝ”๋” ์ถœ๋ ฅ์„ ์˜ˆ์ธก๋œ ๋‹ค์Œ ํ† ํฐ ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
  • two embedding layers์™€ pre-softmax linear transformation ๊ฐ„์— ๋™์ผํ•œ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ๊ณต์œ ํ•จ
  • ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด์—์„œ๋Š” dmodel\sqrt{d_{model}}์„ ๊ฐ€์ค‘์น˜์— ๊ณฑํ•ด์ค€๋‹ค.

3.5. Positional Encoding

  • recurrenct๋‚˜ convolution์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์ด ์‹œํ€€์Šค์˜ ์ˆœ์„œ๋ฅผ ํ™œ์šฉํ•˜๋ ค๋ฉด ์‹œํ€€์Šค์—์„œ ํ† ํฐ์˜ ์ƒ๋Œ€์  ๋˜๋Š” ์ ˆ๋Œ€ ์œ„์น˜์— ๋Œ€ํ•œ ์ผ๋ถ€ ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•ด์•ผ ํ•จ. (๋‹จ์–ด์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์•Œ๋ ค์ค„ ํ•„์š”๊ฐ€ ์žˆ์Œ)
  • ์ด๋ฅผ ์œ„ํ•ด ์ธ์ฝ”๋” ๋ฐ ๋””์ฝ”๋” ์Šคํƒ์˜ ํ•˜๋‹จ์— ์žˆ๋Š” ์ž…๋ ฅ ์ž„๋ฒ ๋”ฉ์— Positional Encoding์„ ์ถ”๊ฐ€ํ•จ

  • sin ํ•จ์ˆ˜์™€ cos ํ•จ์ˆ˜์˜ ๊ฐ’์„ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์— ๋”ํ•ด์ฃผ๋ฏ€๋กœ์„œ ๋‹จ์–ด์˜ ์ˆœ์„œ ์ •๋ณด๋ฅผ ๋”ํ•œ๋‹ค.
  • pospos๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ์—์„œ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ii๋Š” ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๋‚ด์˜ ์ฐจ์›์˜ ์ธ๋ฑ์Šค๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์œ„์˜ ์‹์— ๋”ฐ๋ฅด๋ฉด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๋‚ด์˜ ๊ฐ ์ฐจ์›์˜ ์ธ๋ฑ์Šค๊ฐ€ ์ง์ˆ˜์ธ ๊ฒฝ์šฐ์—๋Š” ์‚ฌ์ธ ํ•จ์ˆ˜์˜ ๊ฐ’์„ ์‚ฌ์šฉํ•˜๊ณ  ํ™€์ˆ˜์ธ ๊ฒฝ์šฐ์—๋Š” ์ฝ”์‚ฌ์ธ ํ•จ์ˆ˜์˜ ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค. ์œ„์˜ ์ˆ˜์‹์—์„œ (pos,2i)(pos, 2i) ์ผ ๋•Œ๋Š” ์‚ฌ์ธ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , (pos,2i+1)(pos, 2i+1) ์ผ ๋•Œ๋Š” ์ฝ”์‚ฌ์ธ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค.

4. Why Self-Attention

nn ์€ ์‹œํ€€์Šค ๊ธธ์ด, dd ๋Š” ํ‘œํ˜„ํ•˜๋Š” ์ฐจ์›, kk๋Š” ์ปจ๋ณผ๋ฃจ์…˜์˜ ์ปค๋„ ์‚ฌ์ด์ฆˆ, rr์€ ์ œํ•œ๋œ self-attention์—์„œ์˜ neighborhood ํฌ๊ธฐ์ž„

(1) Self-Attention vs Recurrent
Recurrent layer๋Š” ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•ด๋‚˜๊ฐ€๊ธฐ์—, ํ•œ๋ฒˆ์— ํ–‰๋ ฌ ๊ณ„์‚ฐ์œผ๋กœ attention์„ ๊ณ„์‚ฐํ•˜๋Š” Self-Attention๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, Sequential Operations ๋ณต์žก๋„๋Š” O(n)์ด๋‹ค. => Self-Attention layer๊ฐ€ Recurrent layer๋ณด๋‹ค sequential operations
๋ฉด์—์„œ ๋น ๋ฆ„
๋˜ํ•œ, n<d ์ธ ๊ฒฝ์šฐ, ๊ณ„์‚ฐ ๋ณต์žก๋„(Complexity per Layer)๋ฉด์—์„œ๋„ Self-Attention layer์ด Recurrent layer๋ณด๋‹ค ๋” ๋‚ซ๋‹ค.

(1) Self-Attention vs Convolutional
Convolutional layers๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ recurrent layers๋ณด๋‹ค kk๋กœ ์ธํ•ด์„œ ๋” ๋งŽ์€ ๋น„์šฉ์ด ๋“ ๋‹ค.
๋˜ํ•œ Maximun Path Length์˜ ๊ฒฝ์šฐ์—๋„ ๋ณต์žก๋„๊ฐ€ Self-Attention ๋ณด๋‹ค ํฌ๋‹ค.

(3) Self-Attention์€ ๋” ํ•ด์„ ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ
Individual attention heads๋Š” ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ช…ํ™•ํ•˜๊ฒŒ ํ•™์Šตํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๋งŽ์€ ๊ฒฝ์šฐ ๋ฌธ์žฅ์˜ ๊ตฌ๋ฌธ ๋ฐ ์˜๋ฏธ ๊ตฌ์กฐ์™€ ๊ด€๋ จ๋œ ํ–‰๋™์„ ๋ณด์ด๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

5. Training

5.1. Training Data and Batching

  • Training Data : standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs (Sentences encoding : using byte-pair encoding (shared source-target vocabulary of about 37000 tokens)
  • Training Data : larger WMT 2014 English-French dataset (split tokens into a 32000 word-piece vocabulary)
  • Batching : ๊ฐ ํ›ˆ๋ จ ๋ฐฐ์น˜์—๋Š” ์•ฝ 25000๊ฐœ์˜ ์†Œ์Šค ํ† ํฐ๊ณผ 25000๊ฐœ์˜ ๋Œ€์ƒ ํ† ํฐ์ด ํฌํ•จ๋œ ๋ฌธ์žฅ ์Œ ์„ธํŠธ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.

5.2. Hardware and Schedule

  • 8๊ฐœ์˜ NVIDIA P100 GPU๊ฐ€ ์žฅ์ฐฉ๋œ ํ•˜๋‚˜์˜ ์‹œ์Šคํ…œ์—์„œ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•จ
  • ๋ฌธ์„œ ์ „๋ฐ˜์— ๊ฑธ์ณ ์„ค๋ช…๋œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๋ณธ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๊ฐ ํ•™์Šต ๋‹จ๊ณ„๋Š” ์•ฝ 0.4์ดˆ๊ฐ€ ์†Œ์š”๋จ
  • ์ด 100,000๋‹จ๊ณ„ ๋˜๋Š” 12์‹œ๊ฐ„ ๋™์•ˆ ๊ธฐ๋ณธ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•จ

5.3. Optimizer

  • Adam optimize (ฮฒ1{\beta}_1 = 0.9 , ฮฒ2{\beta}_2 = 0.98, ฯต{\epsilon} = 10โˆ’910^{-9})
  • learning rate formula
  • ์ดˆ๋ฐ˜์—๋Š” warmup_step 4000์œผ๋กœ ํ•˜์—ฌ ์„ ํ˜•์ ์œผ๋กœ lr์„ ์ฆ๊ฐ€์‹œํ‚ค๊ณ , ์ ์  ๊ฐ์†Œํ•˜๋„๋ก ๊ตฌํ˜„

5.4. Regularization

3๊ฐ€์ง€์˜ Regularization ๋ฐฉ๋ฒ•๋ก  ์‚ฌ์šฉ

  • Residual Dropout1 : Add๊ฐ€ ๋˜๊ธฐ ์ „์— layer output์— Dropout ์ ์šฉ (pdropp_{drop} = 0.1)
  • Residual Dropout2 : encoder and decoder stacks์˜ sums of the embeddings ๊ณผ the positional encodings์— Dropout ์ ์šฉ (pdropp_{drop} = 0.1)
  • Label Smoothing : ฯตls{\epsilon}_{ls} = 0.1 (๋ชจ๋ธ์ด ๋” ๋ถˆํ™•์‹คํ•œ ๊ฒƒ์„ ํ•™์Šตํ•˜๋ฏ€๋กœ ๋ณต์žก์„ฑ์ด ํ•ด์†Œ๋˜์ง€๋งŒ ์ •ํ™•๋„์™€ BLEU ์ ์ˆ˜๋Š” ํ–ฅ์ƒ๋˜์—ˆ์Œ)

6. Results

6.1. Machine Translation

Transformer๋Š” ์ ์€ ๊ต์œก ๋น„์šฉ์œผ๋กœ e English-to-German and English-to-French newstest2014 tests์—์„œ ์ด์ „ ์ตœ์ฒจ๋‹จ ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋‚˜์€ BLEU ์ ์ˆ˜๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

6.2. Model Variations

Transformer์˜ ๋‹ค์–‘ํ•œ ๊ตฌ์„ฑ ์š”์†Œ์˜ ์ค‘์š”์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๊ธฐ๋ณธ ๋ชจ๋ธ์„ ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ๊ฐœ๋ฐœ ์„ธํŠธ์ธ newstest2013์˜ English-to-German ๋ฒˆ์—ญ ์„ฑ๋Šฅ ๋ณ€ํ™”๋ฅผ ์ธก์ •ํ–ˆ๋‹ค.

  • (A)๋ฅผ ๋ณด๋ฉด single-head attention๋Š” ์ตœ๊ณ  ์„ค์ •๋ณด๋‹ค 0.9 BLEU์ ์ˆ˜๊ฐ€ ๋–จ์–ด์ง€์ง€๋งŒ, Head๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์œผ๋ฉด ํ’ˆ์งˆ๋„ ๋–จ์–ด์ง„๋‹ค. (์ ์ ˆํ•œ head ๊ฐœ์ˆ˜๋Š” 16)
  • (B)๋ฅผ ๋ณด๋ฉด attention key size (dkd_k)๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๋ฉด ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง (ํ˜ธํ™˜์„ฑ์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์‰ฝ์ง€ ์•Š์œผ๋ฉฐ ๋‚ด์ ๋ณด๋‹ค ๋” ์ •๊ตํ•œ ํ˜ธํ™˜์„ฑ ํ•จ์ˆ˜๊ฐ€ ์œ ๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌ)
  • (C, D) ๋” ํฐ ๋ชจ๋ธ์ด ๋” ์ข‹๊ณ  ๋“œ๋กญ์•„์›ƒ์ด ๊ณผ์ ํ•ฉ์„ ํ”ผํ•˜๋Š” ๋ฐ ๋งค์šฐ ๋„์›€์ด ๋จ
  • (E) learned positional embeddings ๋ฐฉ๋ฒ•๋ก ์„ ํ™œ์šฉํ•˜๋ฉด ํฐ ์ฐจ์ด ์—†์Œ

6.3. English Constituency Parsing

Transformer๊ฐ€ ๋‹ค๋ฅธ ์ž‘์—…์œผ๋กœ ์ผ๋ฐ˜ํ™”ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์˜์–ด ์„ ๊ฑฐ๊ตฌ ๊ตฌ๋ฌธ ๋ถ„์„์— ๋Œ€ํ•œ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค.

์—…๋กœ๋“œ์ค‘..

RNN sequence-to-sequence models๊ณผ ์ •๋ฐ˜๋Œ€๋กœ, the Transformer๊ฐ€ BerkeleyParser์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. (40K sentences๋กœ ๊ตฌ์„ฑ๋œ Wall Street Journal (WSJ)๋งŒ์„ ํ•™์Šตํ–ˆ๋Š”๋ฐ๋„)

7. Conclusion

โœ… ์ตœ์ดˆ์˜ ์‹œํ€€์Šค ๋ณ€ํ™˜ ๋ชจ๋ธ์ธ Transformer๋ฅผ ์ œ์‹œ
: ์ด ์ž‘์—…์—์„œ ์šฐ๋ฆฌ๋Š” ์ธ์ฝ”๋”-๋””์ฝ”๋” ์•„ํ‚คํ…์ฒ˜์—์„œ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ์ˆœํ™˜ ๋ ˆ์ด์–ด๋ฅผ Multi-headed self-attention์œผ๋กœ ๋Œ€์ฒดํ•˜์—ฌ ์ „์ ์œผ๋กœ Attention์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” Transformer ์ œ์‹œ
โœ… Translation Task์˜ ๊ฒฝ์šฐ Transformer๋Š” Recurrent ๋˜๋Š” Covolutional layer๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๊ฒŒ ํ•™์Šต๋  ์ˆ˜ ์žˆ์Œ
โœ… WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks ๋ชจ๋‘์—์„œ new state of the art(SOTA) ๋‹ฌ์„ฑ
: Best Transformer ๋ชจ๋ธ์€ ์ด์ „์— ๋ณด๊ณ ๋œ ๋ชจ๋“  Ensemble model๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚จ

๐Ÿ’ก ๋‹ค๋ฅธ Task์— ์ ์šฉํ•  ๊ณ„ํš
๐Ÿ’ก ํ…์ŠคํŠธ ์ด์™ธ์˜ ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ์–‘์‹๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์ œ๋กœ Transformer๋ฅผ ํ™•์žฅํ•˜๊ณ  ์ด๋ฏธ์ง€, ์˜ค๋””์˜ค ๋ฐ ๋น„๋””์˜ค์™€ ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ง€์—ญ์ ์ด๊ณ  ์ œํ•œ๋œ ์ฃผ์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์กฐ์‚ฌํ•  ๊ณ„ํš
๐Ÿ’ก Making generation less sequential is another research goals of ours.

Reference
์œ„ํ‚ค๋…์Šค_Trasnformer
attention ๊ตฌ์กฐ
transfomer ๊ณ„์‚ฐ ๊ณผ์ • ์„ค๋ช…

profile
๋ฐฐ์šฐ๊ณ  ๊ณต๋ถ€ํ•˜๊ณ  ๊ธฐ๋กํ•˜๋Š” ๊ฒƒ์„ ๋ฉˆ์ถ”์ง€ ์•Š๋Š”๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€