Transformers

์ด์›์ค€ยท2026๋…„ 4์›” 2์ผ

NLP

๋ชฉ๋ก ๋ณด๊ธฐ
7/9

1. NLP ๋ฐœ์ „ ๊ณผ์ • ํ•œ๋ˆˆ์— ๋ณด๊ธฐ

์—ฐ๋„๋ชจ๋ธ์˜๋ฏธ
2014RNN Seq2Seq์ตœ์ดˆ์˜ ์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„๋ฒˆ์—ญ
2017Transformer"Attention is all you need" โ€” ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜
2018.06GPT-1์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ ๋“ฑ์žฅ
2018.11BERT์–‘๋ฐฉํ–ฅ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ
2019.02GPT-2๋Œ€๊ทœ๋ชจ ์ƒ์„ฑ ๋ชจ๋ธ
2020.05GPT-31,750์–ต ํŒŒ๋ผ๋ฏธํ„ฐ
2022.11ChatGPTGPT-3 ๊ธฐ๋ฐ˜ ๋Œ€ํ™”ํ˜• AI
2023GPT-4, Bard๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€ํ˜• ์–ธ์–ด๋ชจ๋ธ
2024GPT-4o, Gemini๋” ํฌ๊ณ  ๋” ๊ฐ•๋ ฅํ•˜๊ฒŒ

๐Ÿ’ก NLP Mountain: Transformer(2017)๋ฅผ ์ดํ•ดํ•˜๋ฉด ์ดํ›„ BERT, GPT, ChatGPT๊นŒ์ง€ ๋ชจ๋‘ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Transformer๊ฐ€ ํ˜„๋Œ€ NLP์˜ ์ถœ๋ฐœ์ ์ž…๋‹ˆ๋‹ค.


2. LSTM์˜ ํ•œ๊ณ„ โ€” ์™œ Transformer๊ฐ€ ๋“ฑ์žฅํ–ˆ๋Š”๊ฐ€?

Transformer ์ด์ „์—๋Š” LSTM์ด NLP์˜ ํ‘œ์ค€์ด์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์„ธ ๊ฐ€์ง€ ๊ทผ๋ณธ์ ์ธ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

โ‘  ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ (Vanishing Gradient)

100๋‹จ์–ด ๋ฌธ์žฅ์„ LSTM์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด
โ†’ 100์ธต ๊นŠ์ด์˜ ์‹ ๊ฒฝ๋ง์„ ํ†ต๊ณผํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Œ
โ†’ ์—ญ์ „ํŒŒ ์‹œ ๊ธฐ์šธ๊ธฐ๊ฐ€ ์ ์  ์ž‘์•„์ ธ ์•ž์ชฝ ๋‹จ์–ด ํ•™์Šต ๋ถˆ๊ฐ€

์˜ˆ์‹œ:
  "์–ด๋ฆด ๋•Œ๋ถ€ํ„ฐ ๋ถ€์‚ฐ์—์„œ ์ž๋ผ๊ณ  ๋ถ€์‚ฐ ์‚ฌํˆฌ๋ฆฌ๋กœ ๋งํ•˜๋ฉฐ
   ๋ถ€์‚ฐ ์Œ์‹์„ ์ข‹์•„ํ•˜๊ณ  ๋Œ€ํ•™๋„ ๋ถ€์‚ฐ์—์„œ ๋‹ค๋…”๊ธฐ ๋•Œ๋ฌธ์—
   ๋‚ด ๋ชจ๊ตญ์–ด๋Š” ๋‹น์—ฐํžˆ [   ]์ด๋‹ค"

  โ†’ LSTM์€ ๋ฉ€๋ฆฌ ์žˆ๋Š” "๋ถ€์‚ฐ"์„ ๊ธฐ์–ต ๋ชปํ•จ
  โ†’ "ํ•œ๊ตญ์–ด"๋ผ๋Š” ์ •๋‹ต ์˜ˆ์ธก ์–ด๋ ค์›€

โ‘ก ์ „์ดํ•™์Šต(Transfer Learning) ์–ด๋ ค์›€

LSTM:
  ๊ฐ์„ฑ๋ถ„์„ โ†’ ๊ฐ์„ฑ๋ถ„์„์šฉ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ํ•„์š”
  ๋ฒˆ์—ญ     โ†’ ๋ฒˆ์—ญ์šฉ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ํ•„์š”
  ์š”์•ฝ     โ†’ ์š”์•ฝ์šฉ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ํ•„์š”
  โ†’ ํƒœ์Šคํฌ๋งˆ๋‹ค ๋ณ„๋„ labeled ๋ฐ์ดํ„ฐ์…‹ ํ•„์š”

Transformer:
  ๋Œ€๊ทœ๋ชจ ํ…์ŠคํŠธ๋กœ ํ•œ ๋ฒˆ ์‚ฌ์ „ํ•™์Šต
  โ†’ ์—ฌ๋Ÿฌ ํƒœ์Šคํฌ์— fine-tuning์œผ๋กœ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ

โ‘ข ์ˆœ์ฐจ ์ฒ˜๋ฆฌ โ†’ GPU ๋น„ํšจ์œจ

LSTM:
  ๋‹จ์–ดโ‚ โ†’ ๋‹จ์–ดโ‚‚ โ†’ ๋‹จ์–ดโ‚ƒ โ†’ ... โ†’ ๋‹จ์–ดโ‚โ‚€โ‚€
  ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌ โ†’ ๋ณ‘๋ ฌํ™” ๋ถˆ๊ฐ€ โ†’ GPU ๋‚ญ๋น„

Transformer:
  ๋‹จ์–ดโ‚, ๋‹จ์–ดโ‚‚, ..., ๋‹จ์–ดโ‚โ‚€โ‚€ ๋™์‹œ ์ฒ˜๋ฆฌ
  โ†’ ์™„์ „ ๋ณ‘๋ ฌํ™” โ†’ GPU 100% ํ™œ์šฉ

3. Transformer ํ•ต์‹ฌ ํŠน์ง•

๐Ÿ’ก ๋…ผ๋ฌธ ์ œ๋ชฉ: "Attention is All You Need" (Google, 2017)
RNN๋„ CNN๋„ ์—†์ด Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜๋งŒ์œผ๋กœ seq2seq๋ฅผ ๊ตฌํ˜„ํ•œ ์ตœ์ดˆ์˜ ๋ชจ๋ธ

์ฃผ์š” ์„ฑ๊ณผ

ํ•™์Šต ๋ฐ์ดํ„ฐ: WMT2014 ์˜์–ด-๋…์ผ์–ด ๋ฒˆ์—ญ (1.6GB)
ํ•˜๋“œ์›จ์–ด:    P100 GPU 8๊ฐœ
ํ•™์Šต ์‹œ๊ฐ„:   3.5์ผ
๊ฒฐ๊ณผ:        BLEU Score SOTA (๋‹น์‹œ ์ตœ๊ณ  ์„ฑ๋Šฅ) ๋‹ฌ์„ฑ

Transformer vs LSTM ๋น„๊ต

ํ•ญ๋ชฉLSTMTransformer
์ฒ˜๋ฆฌ ๋ฐฉ์‹์ˆœ์ฐจ์  (Sequential)๋ณ‘๋ ฌ (Parallel)
GPU ํ™œ์šฉ๋‚ฎ์Œ๋งค์šฐ ๋†’์Œ
์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์•ฝํ•จ (๊ธฐ์šธ๊ธฐ ์†Œ์‹ค)๊ฐ•ํ•จ (Attention)
์ „์ดํ•™์Šต์–ด๋ ค์›€์šฉ์ดํ•จ
๊ตฌ์กฐRNN ๊ธฐ๋ฐ˜Attention ๊ธฐ๋ฐ˜

๐Ÿ—๏ธTransformer ๊ตฌ์กฐ

  1. ๊ธฐ์กด์˜ encoder-decoder ๊ตฌ์กฐ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • encoder : ๋ฌธ๋งฅ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ์— ์œ ์šฉ ex) BERT
  • decoder : text-generation โ†’ ๊ธ€์„ ์ž˜์”€ ex) open ai
  1. encoder-decoer์˜ ์ธต์„ ๊ณ„์† ์Œ“์„ ์ˆ˜ ์žˆ์–ด, ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋งŒ ์ถฉ๋ถ„ํžˆ ๊ณต๊ธ‰๋˜๋ฉด ๋ชจ๋ธ์„ ๊ณ„์†ํ•ด์„œ ํ‚ค์›Œ๋‚˜๊ฐˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค..

image (31).png

๋‹จ๊ณ„๋‚ด์šฉ
Step 1๋‹จ์–ด โ†’ 512์ฐจ์› ์ˆซ์ž ๋ฒกํ„ฐ ๋ณ€ํ™˜ (Embedding)
Step 2RNN์ด ์—†์œผ๋‹ˆ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌ ๋ถˆ๊ฐ€ โ†’ sin/cos๋กœ ์ˆœ์„œ ์ •๋ณด ์ถ”๊ฐ€ (Positional Encoding)
Step 3๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด ๊ฐ„ ๊ด€๋ จ๋„ ๊ณ„์‚ฐ (Self-Attention)
Step 48๊ฐ€์ง€ ์‹œ๊ฐ์œผ๋กœ ๋™์‹œ ๋ถ„์„ (Multi-Head Attention)
Step 5Encoderโ†”Decoder ์—ฐ๊ฒฐ, ์ž…๋ ฅ ์ฐธ์กฐ (Cross-Attention)
Step 6๋ฏธ๋ž˜ ๋‹จ์–ด ์ฐจ๋‹จํ•˜๋ฉฐ ์ˆœ์ฐจ ์ƒ์„ฑ (Masked Attention)
Step 7ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜ ํ›„ ๋‹ค์Œ ๋‹จ์–ด ์„ ํƒ (Linear + Softmax)

4. Self-Attention โ€” ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด๋“ค์˜ ๊ด€๊ณ„ ํŒŒ์•…

๐Ÿง  Self-Attention์ด๋ž€?

Attention์„ ์ž๊ธฐ ์ž์‹ ์— ๋Œ€ํ•ด ์ˆ˜ํ–‰ โ†’ "๋ฌธ์žฅ ์•ˆ์—์„œ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค๊ณผ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ์ด ์žˆ๋Š”๊ฐ€"๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜

๋ฌธ์žฅ: "๊ฐ๋…์ด ์„ ์ˆ˜๋“ค์—๊ฒŒ ์ „์ˆ ์„ ์„ค๋ช…ํ–ˆ๋‹ค"

"์ „์ˆ "์ด๋ผ๋Š” ๋‹จ์–ด๋ฅผ ์ดํ•ดํ•  ๋•Œ:
  ๊ฐ๋… โ†” ์ „์ˆ : ๋†’์€ ์—ฐ๊ด€์„ฑ (๊ฐ๋…์ด ์ „์ˆ ์„ ์งฌ)
  ์„ ์ˆ˜๋“ค โ†” ์ „์ˆ : ๋†’์€ ์—ฐ๊ด€์„ฑ (์ „์ˆ ์„ ๋“ฃ๋Š” ๋Œ€์ƒ)
  ์„ค๋ช…ํ–ˆ๋‹ค โ†” ์ „์ˆ : ๋†’์€ ์—ฐ๊ด€์„ฑ (์ „์ˆ ์„ ์„ค๋ช…)

โ†’ "์ „์ˆ "์˜ ์˜๋ฏธ๋ฅผ ๋ฌธ์žฅ ์ „์ฒด ๋งฅ๋ฝ์—์„œ ํŒŒ์•…

๐Ÿ”‘ Query, Key, Value โ€” ๊ฒ€์ƒ‰ ์—”์ง„ ๋น„์œ 

Self-Attention์€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ๊ฒ€์ƒ‰๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

์œ ํŠœ๋ธŒ ๊ฒ€์ƒ‰ ์˜ˆ์‹œ:
  Query:  "์ถ•๊ตฌ ํ•˜์ด๋ผ์ดํŠธ"      โ† ๋‚ด๊ฐ€ ์ฐพ๋Š” ๊ฒƒ
  Key:    ๊ฐ ์˜์ƒ์˜ ์ œ๋ชฉ/ํƒœ๊ทธ    โ† ๊ฒ€์ƒ‰ ์ธ๋ฑ์Šค
  Value:  ์‹ค์ œ ์˜์ƒ ๋‚ด์šฉ         โ† ๋ฐ˜ํ™˜๋  ๊ฒฐ๊ณผ

  Query์™€ ๋ชจ๋“  Key๋ฅผ ๋น„๊ตํ•ด์„œ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
  โ†’ ์œ ์‚ฌ๋„ ๋†’์€ Value๋ฅผ ๊ฐ€์ค‘์น˜ ์ ์šฉํ•˜์—ฌ ๋ฐ˜ํ™˜
Self-Attention์—์„œ:
  Query:  ํ˜„์žฌ ์ฒ˜๋ฆฌ ์ค‘์ธ ๋‹จ์–ด     โ† "์ด ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡๊ณผ ๊ด€๋ จ์žˆ๋‚˜?"
  Key:    ๋ฌธ์žฅ ๋‚ด ๋ชจ๋“  ๋‹จ์–ด       โ† "๋‚˜๋Š” ์ด๋Ÿฐ ๋‹จ์–ด์•ผ" 
  Value:  ๊ฐ ๋‹จ์–ด์˜ ์‹ค์ œ ์ •๋ณด     โ† "๊ด€๋ จ ์žˆ์œผ๋ฉด ์ด ์ •๋ณด๋ฅผ ๊ฐ€์ ธ๊ฐ€"

๐Ÿ“ Self-Attention 4๋‹จ๊ณ„ ๊ณ„์‚ฐ

1๋‹จ๊ณ„: Q, K, V ๋ฒกํ„ฐ ์ƒ์„ฑ

๊ฐ ๋‹จ์–ด์˜ Embedding ๋ฒกํ„ฐ (512์ฐจ์›)๋ฅผ
๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ Wq, Wk, Wv์™€ ๊ณฑํ•ด์„œ
64์ฐจ์›์˜ Q(์งˆ๋ฌธ์šฉ), K(์ธ๋ฑ์Šค์šฉ), V(์ •๋ณด์šฉ) ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜

512์ฐจ์› โ†’ 64์ฐจ์›์ธ ์ด์œ :
  512 / num_heads(8) = 64
  8๊ฐœ์˜ Head๊ฐ€ ๊ฐ๊ฐ 64์ฐจ์›์„ ์ฒ˜๋ฆฌ 

Wq, Wk, Wv๋Š” ํ•™์Šต ๊ณผ์ •์—์„œ ์ž๋™์œผ๋กœ ํ•™์Šต๋จ:
  ์ฒ˜์Œ์—๋Š” ๋žœ๋คํ•œ ์ˆซ์ž๋กœ ์ดˆ๊ธฐํ™”

  ํ•™์Šต ๋ฐ์ดํ„ฐ:
   ์ž…๋ ฅ: "๋‚˜๋Š” ์ปคํ”ผ๋ฅผ ๋งˆ์…จ๋‹ค"
  ์ •๋‹ต: "I drank coffee"

  ์˜ˆ์ธก์ด ํ‹€๋ฆฌ๋ฉด โ†’ loss ๊ณ„์‚ฐ
               โ†’ backpropagation
               โ†’ Wq, Wk, Wv ์กฐ๊ธˆ์”ฉ ์—…๋ฐ์ดํŠธ

  ์ˆ˜๋ฐฑ๋งŒ๋ฒˆ ๋ฐ˜๋ณตํ•˜๋ฉด
    โ†’ ๋ฒˆ์—ญ์ด ์ž˜ ๋˜๋„๋ก Q, K, V๋ฅผ ๋งŒ๋“œ๋Š”
      ์ตœ์ ์˜ Wq, Wk, Wv ํ–‰๋ ฌ์ด ์™„์„ฑ๋จ

2๋‹จ๊ณ„: Attention Score ๊ณ„์‚ฐ (Scaled Dot Product)

๋ชจ๋“  K vector์— ๋Œ€ํ•˜์—ฌ attention score๋ฅผ ๊ตฌํ•จ.

๋ฌธ์žฅ: "๋‚˜๋Š” ์ปคํ”ผ๋ฅผ ๋งˆ์…จ๋‹ค"

"์ปคํ”ผ"์— ๋Œ€ํ•œ Attention Score ๊ณ„์‚ฐ:
  ๋‚˜๋Š”    : Q_์ปคํ”ผ ยท K_๋‚˜๋Š”    = 3.2
  ์ปคํ”ผ    : Q_์ปคํ”ผ ยท K_์ปคํ”ผ    = 8.7  โ† ์ž๊ธฐ ์ž์‹ 
  ๋ฅผ      : Q_์ปคํ”ผ ยท K_๋ฅผ      = 1.1
  ๋งˆ์…จ๋‹ค  : Q_์ปคํ”ผ ยท K_๋งˆ์…จ๋‹ค  = 5.4

Scaling: ๊ฐ score๋ฅผ โˆš64 = 8๋กœ ๋‚˜๋ˆ”
โ†’ ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฐฉ์ง€ (๊ฐ’์ด ๋„ˆ๋ฌด ํฌ๋ฉด softmax ํ›„ ๊ธฐ์šธ๊ธฐ โ‰ˆ 0)

3๋‹จ๊ณ„: Softmax๋กœ Attention ๋ถ„ํฌ ๊ณ„์‚ฐ

Scaled Score๋ฅผ Softmax๋กœ ๋ณ€ํ™˜ โ†’ ํ™•๋ฅ  ๋ถ„ํฌ

"์ปคํ”ผ"์˜ Attention ๋ถ„ํฌ:
  ๋‚˜๋Š”   : 0.12
  ์ปคํ”ผ   : 0.61  โ† ๊ฐ€์žฅ ๋†’์Œ
  ๋ฅผ     : 0.04
  ๋งˆ์…จ๋‹ค : 0.23
  ํ•ฉ๊ณ„   : 1.00

4๋‹จ๊ณ„: Value ๋ฒกํ„ฐ ๊ฐ€์ค‘ํ•ฉ โ†’ Attention Value

์ตœ์ข… "์ปคํ”ผ"์˜ Context Vector =
  0.12 ร— V_๋‚˜๋Š”
+ 0.61 ร— V_์ปคํ”ผ
+ 0.04 ร— V_๋ฅผ
+ 0.23 ร— V_๋งˆ์…จ๋‹ค

โ†’ ๋ฌธ์žฅ ์ „์ฒด ๋งฅ๋ฝ์ด ๋ฐ˜์˜๋œ "์ปคํ”ผ"์˜ ์ƒˆ๋กœ์šด ํ‘œํ˜„(์ปคํ”ผ์˜ attention value : ๋‹จ์–ด์— ๋Œ€ํ•œ Context Vector)

์ •๋ฆฌ

image.png

๐Ÿ“Š ์ˆ˜์‹ ์ •๋ฆฌ

Attention(Q, K, V) = softmax(QยทKแต€ / โˆšdโ‚–) ยท V

QยทKแต€  : Query์™€ ๋ชจ๋“  Key์˜ ์œ ์‚ฌ๋„ (๋‚ด์ )
โˆšdโ‚–   : Scaling (dโ‚– = 64, โˆš64 = 8)
softmax: ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜
ยทV    : Value ๋ฒกํ„ฐ ๊ฐ€์ค‘ํ•ฉ

๐Ÿ”ข ํ–‰๋ ฌ ์—ฐ์‚ฐ์œผ๋กœ ์ „์ฒด ๋ฌธ์žฅ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌ

์ž…๋ ฅ: 4๊ฐœ ๋‹จ์–ด, ๊ฐ 512์ฐจ์›
      X = [4, 512]

Q = X ยท Wq  โ†’  [4, 64]
K = X ยท Wk  โ†’  [4, 64]
V = X ยท Wv  โ†’  [4, 64]

Attention = softmax(QยทKแต€/8) ยท V  โ†’  [4, 64]

โ†’ 4๊ฐœ ๋‹จ์–ด๋ฅผ ๋™์‹œ์— (๋ณ‘๋ ฌ๋กœ) ์ฒ˜๋ฆฌ!

5. Multi-Head Attention โ€” ๋‹ค์–‘ํ•œ ์‹œ๊ฐ์œผ๋กœ ๋ฐ”๋ผ๋ณด๊ธฐ

์—ฌ๋Ÿฌ ๊ฐœ์˜ attention์„ ๋ณ‘๋ ฌ๋กœ ์‚ฌ์šฉํ•œ ํ›„ Attention Head๋ฅผ ์—ฐ๊ฒฐํ•ด์„œ ๋‹ค๋ฅธ ์‹œ๊ฐ์œผ๋กœ ๋‹จ์–ด ๊ฐ„์˜ ์ƒ๊ด€ ๊ด€๊ณ„ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค.

๐ŸŽฏ ์™œ Multi-Head์ธ๊ฐ€?

ํ•˜๋‚˜์˜ Attention Head๋กœ ๋ณด๋Š” ๊ฒƒ:
  "์ปคํ”ผ๋ฅผ ๋งˆ์…จ๋‹ค" โ†’ ๋™์‚ฌ-๋ชฉ์ ์–ด ๊ด€๊ณ„๋งŒ ํฌ์ฐฉ

์—ฌ๋Ÿฌ ๊ฐœ์˜ Attention Head๋กœ ๋ณด๋Š” ๊ฒƒ:
  Head 0: ๋ฌธ๋ฒ•์  ๊ด€๊ณ„ ํŒŒ์•…  (์ฃผ์–ด-๋™์‚ฌ)
  Head 1: ์˜๋ฏธ์  ๊ด€๊ณ„ ํŒŒ์•…  (์ปคํ”ผ-์นดํŽ˜์ธ)
  Head 2: ์ง€์‹œ ๊ด€๊ณ„ ํŒŒ์•…    (๋Œ€๋ช…์‚ฌ-๋ช…์‚ฌ)
  ...
  Head 7: ์‹œ์ œ ๊ด€๊ณ„ ํŒŒ์•…    (๊ณผ๊ฑฐ-ํ˜„์žฌ)

โ†’ ๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ 8๊ฐ€์ง€ ๋‹ค๋ฅธ ์‹œ๊ฐ์—์„œ ๋™์‹œ์— ๋ถ„์„

โš™๏ธ Multi-Head Attention ์ž‘๋™ ๋ฐฉ์‹

์ž…๋ ฅ (512์ฐจ์›)
    โ†“
8๊ฐœ์˜ Head๋กœ ๋ถ„ํ•  (๊ฐ 64์ฐจ์›)
    โ†“
๊ฐ Head์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ Attention ๊ณ„์‚ฐ
    โ†“
8๊ฐœ Head ๊ฒฐ๊ณผ Concatenate โ†’ [seq_len, 512]
    โ†“
์ตœ์ข… ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ Wโ‚€์™€ ๊ณฑํ•จ โ†’ [seq_len, 512]

์ถœ๋ ฅ ํฌ๊ธฐ๊ฐ€ ์ž…๋ ฅ๊ณผ ๋™์ผํ•œ ์ด์œ 

Transformer๋Š” Encoder๋ฅผ 6๊ฐœ ์Œ“์€ ๊ตฌ์กฐ
โ†’ ๊ฐ Encoder์˜ ์ถœ๋ ฅ์ด ๋‹ค์Œ Encoder์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ
โ†’ ์ž…๋ ฅ(512) = ์ถœ๋ ฅ(512)์œผ๋กœ ์œ ์ง€๋˜์–ด์•ผ ํ•จ

๐Ÿ“ ์ „์ฒด ๊ณ„์‚ฐ ํ๋ฆ„

์ž…๋ ฅ ๋ฌธ์žฅ: "์Šคํƒ€ํŠธ์—… ํˆฌ์ž๊ฐ€ ์ฆ๊ฐ€ํ–ˆ๋‹ค" (2๋‹จ์–ด ์˜ˆ์‹œ)

[2, 512]   ์ž…๋ ฅ Embedding
    โ†“
Q/K/V ํ–‰๋ ฌ๋กœ ๋ถ„ํ• 
    โ†“
Head#0 [2, 64] โ€” ์ฃผ์–ด-๋™์‚ฌ ๊ด€๊ณ„
Head#1 [2, 64] โ€” ์‹œ์ œ ์ •๋ณด
...
Head#7 [2, 64] โ€” ์˜๋ฏธ๋ก ์  ๊ด€๊ณ„
    โ†“
Concatenate โ†’ [2, 512]
    โ†“
ร— Wโ‚€ [512, 512] โ€” Dense Layer 
    โ†“
[2, 512]   ์ตœ์ข… Multi-Head Attention ์ถœ๋ ฅ

๐Ÿ’ก BERT์™€์˜ ์ฐจ์ด: BERT๋Š” d_model=768, num_heads=12 (12ร—64=768) ์‚ฌ์šฉ


6. Position-wise Feed Forward NN(Dense Layer)

Encoder์™€ Decoder์˜ ๊ฐ๊ฐ์˜ layer์—์„œ ๋ณด์œ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ—๏ธ ๊ตฌ์กฐ

Multi-Head Attention ์ถœ๋ ฅ (seq_len, 512)
    โ†“
Linear (512 โ†’ 2048)  +  ReLU
    โ†“
Linear (2048 โ†’ 512)
    โ†“
์ถœ๋ ฅ (seq_len, 512)

๐Ÿ’ก ์™œ ํ•„์š”ํ•œ๊ฐ€?

Self-Attention: ๋‹จ์–ด๋“ค ๊ฐ„์˜ ๊ด€๊ณ„(์ƒํ˜ธ์ž‘์šฉ) ํ•™์Šต
Feed Forward:   ๊ฐ ๋‹จ์–ด์˜ ํ‘œํ˜„์„ ๋” ํ’๋ถ€ํ•˜๊ฒŒ ๋ณ€ํ™˜

๋น„์œ :
  Self-Attention = ํšŒ์˜ (ํŒ€์›๋“ค์ด ์„œ๋กœ ์ •๋ณด ๊ณต์œ )
  Feed Forward   = ๊ฐœ์ธ ํ•™์Šต (๊ฐ์ž ๋ฐ›์€ ์ •๋ณด๋ฅผ ๋‚ด์žฌํ™”)

Position-wise์˜ ์˜๋ฏธ

"Position-wise" = ๋‹จ์–ด๋ณ„๋กœ ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ

์ž…๋ ฅ: [๋‹จ์–ด1, ๋‹จ์–ด2, ๋‹จ์–ด3, ๋‹จ์–ด4]
โ†’ ๊ฐ ๋‹จ์–ด์— ๋™์ผํ•œ FFN์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ
โ†’ ๋‹จ์–ด๋“ค ๊ฐ„์˜ ์ •๋ณด ๊ตํ™˜ ์—†์Œ (๊ทธ๊ฑด Attention์ด ๋‹ด๋‹น)

7. ์ž”์ฐจ ์—ฐ๊ฒฐ & ์ธต ์ •๊ทœํ™” (Residual Connection & Layer Normalization)

Transformer๋Š” ์„œ๋ธŒ์ธต์˜ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์ด ๋™์ผํ•œ ์ฐจ์›์„ ์œ ์ง€ํ•˜๋ฏ€๋กœ ์ž”์ฐจ ์—ฐ๊ฒฐ์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— Vanishing Gradient ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”— ์ž”์ฐจ ์—ฐ๊ฒฐ (Residual Connection)

์ผ๋ฐ˜ ์‹ ๊ฒฝ๋ง:
  ์ถœ๋ ฅ = F(์ž…๋ ฅ)

์ž”์ฐจ ์—ฐ๊ฒฐ (ResNet ๋ฐฉ์‹):
  ์ถœ๋ ฅ = F(์ž…๋ ฅ) + ์ž…๋ ฅ
          โ†‘           โ†‘
        ๋ณ€ํ™˜๋œ ๊ฐ’  ์›๋ž˜ ๊ฐ’ ๊ทธ๋Œ€๋กœ ๋”ํ•จ

โ†’ ํ•™์Šต์ด ์ž˜ ์•ˆ ๋˜๋”๋ผ๋„ ์›๋ž˜ ๊ฐ’(์ž…๋ ฅ)์€ ๋ณด์กด
โ†’ Vanishing Gradient ํ•ด๊ฒฐ

์ง๊ด€์  ์ดํ•ด

"์ƒˆ๋กœ์šด ๊ฒƒ์„ ๋ฐฐ์šฐ๋˜, ์ด์ „ ์ง€์‹์€ ์žŠ์ง€ ์•Š๋Š”๋‹ค"

์˜ˆ์‹œ:
  ์‚ฌ์ „ ์ง€์‹: "์ปคํ”ผ๋Š” ์Œ๋ฃŒ๋‹ค"
  ์ƒˆ๋กœ์šด ํ•™์Šต: "์ปคํ”ผ๋Š” ์นดํŽ˜์ธ์ด ์žˆ๋‹ค"

  ์ž”์ฐจ ์—ฐ๊ฒฐ ์—†์ด: ์ƒˆ๋กœ์šด ํ•™์Šต์ด ๊ธฐ์กด ์ง€์‹์„ ๋ฎ์–ด์”€ โ†’ ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ์œ„ํ—˜
  ์ž”์ฐจ ์—ฐ๊ฒฐ ์žˆ์ด: ๊ธฐ์กด ์ง€์‹ + ์ƒˆ๋กœ์šด ํ•™์Šต โ†’ ์•ˆ์ •์ ์ธ ํ•™์Šต

๐Ÿ“ ์ธต ์ •๊ทœํ™” (Layer Normalization)

๊ฐ ์ธต์˜ ์ถœ๋ ฅ์„ ํ‰๊ท  0, ํ‘œ์ค€ํŽธ์ฐจ 1๋กœ ์ •๊ทœํ™”

ํšจ๊ณผ:
  ํ•™์Šต ์•ˆ์ •ํ™”
  ๋” ๋น ๋ฅธ ์ˆ˜๋ ด
  Internal Covariance Shift ํ•ด๊ฒฐ

Encoder/Decoder ๊ฐ ์„œ๋ธŒ์ธต ๊ตฌ์กฐ

์ž…๋ ฅ
 โ†“
[Multi-Head Attention ๋˜๋Š” FFN]
 โ†“
Add (์ž”์ฐจ ์—ฐ๊ฒฐ: + ์ž…๋ ฅ)
 โ†“
LayerNorm
 โ†“
์ถœ๋ ฅ

์ฐธ๊ณ : Vanishing Gradient ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•๋“ค

๋ฐฉ๋ฒ•์„ค๋ช…
ReLU ํ™œ์„ฑํ™” ํ•จ์ˆ˜max(0, z) โ€” ์–‘์ˆ˜ ๊ตฌ๊ฐ„์—์„œ ๊ธฐ์šธ๊ธฐ = 1 ์œ ์ง€
์‹ ์ค‘ํ•œ ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”Xavier (Tanh์šฉ), He (ReLU์šฉ)
Batch Normalization๊ฐ ์ธต ์ž…๋ ฅ์„ ์ •๊ทœํ™”
Residual Connection์ž…๋ ฅ์„ ์ถœ๋ ฅ์— ์ง์ ‘ ์—ฐ๊ฒฐ (ResNet)

Batch Normalization์ด๋ž€?

๋ฌธ์ œ (Internal Covariance Shift):
  Layer๋ฅผ ๊ฑฐ์น ์ˆ˜๋ก ์ž…๋ ฅ ๋ถ„ํฌ๊ฐ€ ๊ณ„์† ๋ณ€ํ•จ
  โ†’ ์ตœ์ดˆ Input Layer๊ฐ€ ํ‘œ์ค€ ์ •๊ทœ๋ถ„ํฌ๊ฐ€ ๋˜๋„๋ก normalization ํ•˜๋Š”๋ฐ ์ด ํšจ๊ณผ๊ฐ€ Hidden Layer๋ฅผ ๊ฑฐ์น˜๋ฉด์„œ ํฌ์„๋จ.

ํ•ด๊ฒฐ:
  ๊ฐ ์ธต์˜ ์ž…๋ ฅ(์ „ ๋‹จ๊ณ„์˜ output)์„ ํ‰๊ท  0, ํ‘œ์ค€ํŽธ์ฐจ 1๋กœ ์ •๊ทœํ™”
  โ†’ ์•ˆ์ •์ ์ธ ๋ถ„ํฌ ์œ ์ง€
  โ†’ ๋” ๋น ๋ฅธ ํ•™์Šต ๊ฐ€๋Šฅ

8. Positional Encoding โ€” ์œ„์น˜ ์ •๋ณด ์ถ”๊ฐ€ํ•˜๊ธฐ

Embedding vector์— positional encoding ๊ฐ’์„ ์ถ”๊ฐ€ํ•ด์„œ ์ˆœ์„œ ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•ด์ค๋‹ˆ๋‹ค.

๐Ÿ—บ๏ธ ์™œ ํ•„์š”ํ•œ๊ฐ€?

LSTM: ๋‹จ์–ด๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌ โ†’ ์œ„์น˜ ์ •๋ณด ์ž๋™ ํฌํ•จ
Transformer: ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌ โ†’ ์œ„์น˜ ์ •๋ณด ์—†์Œ!

๋ฌธ์žฅ: "๊ณ ์–‘์ด๊ฐ€ ์ฅ๋ฅผ ์žก์•˜๋‹ค"
      "์ฅ๋ฅผ ๊ณ ์–‘์ด๊ฐ€ ์žก์•˜๋‹ค"

โ†’ Transformer๋Š” ๋‘ ๋ฌธ์žฅ์„ ๊ฐ™์€ ๊ฒƒ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์œ„ํ—˜
โ†’ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ณ„๋„๋กœ ์ถ”๊ฐ€ํ•ด์•ผ ํ•จ

๐Ÿ“ Positional Encoding ์ˆ˜์‹

์ง์ˆ˜ ์œ„์น˜: PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
ํ™€์ˆ˜ ์œ„์น˜: PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

pos: ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด์˜ ์œ„์น˜ (0, 1, 2, ...)
i:   Embedding ๋ฒกํ„ฐ ๋‚ด ์ฐจ์›์˜ ์ธ๋ฑ์Šค
d_model: Embedding ์ฐจ์› (512)

์ง๊ด€์  ์ดํ•ด

๊ฐ ์œ„์น˜๋งˆ๋‹ค ๊ณ ์œ ํ•œ "์œ„์น˜ ์ง€๋ฌธ" ์ƒ์„ฑ

์œ„์น˜ 0: [sin(0), cos(0), sin(0), cos(0), ...]
์œ„์น˜ 1: [sin(1), cos(1), sin(0.1), cos(0.1), ...]
์œ„์น˜ 2: [sin(2), cos(2), sin(0.2), cos(0.2), ...]

โ†’ ๊ฐ™์€ ๋‹จ์–ด๋ผ๋„ ์œ„์น˜์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์ž…๋ ฅ๊ฐ’์„ ๊ฐ€์ง(positional encoding๊ฐ’์ด ์ถ”๊ฐ€๋˜์–ด์„œ)
โ†’ ์‹ ๊ฒฝ๋ง์ด ์ƒ๋Œ€์  ์œ„์น˜ ๊ด€๊ณ„๋ฅผ ์‰ฝ๊ฒŒ ํ•™์Šต ๊ฐ€๋Šฅ

์™œ sin/cos์„ ์“ฐ๋Š”๊ฐ€?

์ •ํ˜„ํŒŒ(sin/cos)์˜ ํŠน์„ฑ:
  ๊ทœ์น™์ ์œผ๋กœ ๋ฐ˜๋ณต๋˜๋Š” ์ฃผ๊ธฐ ํ•จ์ˆ˜
  โ†’ ์‹ ๊ฒฝ๋ง์ด ์ƒ๋Œ€์  ์œ„์น˜ ํŒจํ„ด์„ ์‰ฝ๊ฒŒ ํ•™์Šต
  โ†’ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ณด๋‹ค ๊ธด ๋ฌธ์žฅ๋„ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

์˜ˆ์‹œ:
  "์Œ์•…์˜ ๋ฐ•์ž"์ฒ˜๋Ÿผ ์ฃผ๊ธฐ์ ์ธ ํŒจํ„ด์œผ๋กœ
  ๊ฐ ๋‹จ์–ด์˜ ์œ„์น˜๋ฅผ ๊ณ ์œ ํ•˜๊ฒŒ ํ‘œํ˜„

๐Ÿ’ก BERT์™€์˜ ์ฐจ์ด: BERT๋Š” ์ˆ˜์‹ ๊ธฐ๋ฐ˜ Positional Encoding ๋Œ€์‹  ํ•™์Šต ๊ฐ€๋Šฅํ•œ Positional Embedding ์‚ฌ์šฉ


9. Decoder ๊ตฌ์กฐ

Decoder๋Š” ์ด์ „ ์ถœ๋ ฅ + Encoder ์ •๋ณด(๋ฌธ๋งฅ)๋ฅผ ์ด์šฉํ•ด์„œ ํ•œ ๋‹จ์–ด์”ฉ ์ƒ์„ฑํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

1. Encoder๊ฐ€ ๋ฌธ์žฅ ์ดํ•ด(Top Encoder์˜ output) โ†’ Key, Value ์ƒ์„ฑ
2. Decoder๋Š” ์ด์ „ ๋‹จ์–ด๋กœ Query ์ƒ์„ฑ
   -Training(ํ•™์Šต): ์ •๋‹ต ๋ฌธ์žฅ (ground truth) ์‚ฌ์šฉ(treacher-forcing)
   -Inference(์ถ”๋ก ): ์ด์ „์— ์ƒ์„ฑํ•œ ๋‹จ์–ด ์‚ฌ์šฉ(Auto-Regressive)
3. Encoder-Decoder Attention์œผ๋กœ ์ค‘์š”ํ•œ ๋ถ€๋ถ„ ์ฐธ๊ณ 
4. ๋‹ค์Œ ๋‹จ์–ด ์ƒ์„ฑ
5. ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์‹œ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ (๋ฐ˜๋ณต)

๐Ÿ—๏ธ Decoder์˜ 3๊ฐ€์ง€ ์„œ๋ธŒ์ธต

Decoder = 3๊ฐœ ์„œ๋ธŒ์ธต์œผ๋กœ ๊ตฌ์„ฑ
  โ‘  Masked Multi-Head Self-Attention
  โ‘ก Encoder-Decoder Attention (Cross-Attention)
  โ‘ข Position-wise Feed Forward NN

โ‘  Masked Multi-Head Self-Attention(Encoder์™€ ๋‹ค๋ฅธ ๋ถ€๋ถ„)

์™œ Masking์ด ํ•„์š”ํ•œ๊ฐ€?

๋ฒˆ์—ญ ํ•™์Šต ์˜ˆ์‹œ:
  ์ž…๋ ฅ:  "๋‚˜๋Š” ์ปคํ”ผ๋ฅผ ๋งˆ์…จ๋‹ค"
  ๋ชฉํ‘œ:  "I drank coffee"

Decoder๊ฐ€ "drank"๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ:
  ๋ณผ ์ˆ˜ ์žˆ์–ด์•ผ ํ•  ๊ฒƒ: "I" (์ด์ „ ๋‹จ์–ด)
  ๋ณด๋ฉด ์•ˆ ๋˜๋Š” ๊ฒƒ:    "coffee" (๋ฏธ๋ž˜ ๋‹จ์–ด) โ† ์น˜ํŒ…!

โ†’ ๋ฏธ๋ž˜ ์œ„์น˜๋ฅผ -โˆž๋กœ Masking
โ†’ Softmax ํ›„ ํ•ด๋‹น ์œ„์น˜์˜ ํ™•๋ฅ  โ‰ˆ 0
โ†’ ์ด์ „ ์œ„์น˜๋งŒ ์ฐธ์กฐ ๊ฐ€๋Šฅ
Masking ํ–‰๋ ฌ ์˜ˆ์‹œ (4๋‹จ์–ด ์ถœ๋ ฅ):

        I    drank  coffee  <EOS>
I       0    -inf   -inf    -inf
drank   0      0    -inf    -inf
coffee  0      0      0     -inf
<EOS>   0      0      0       0

โ†’ ๊ฐ ๋‹จ์–ด๋Š” ์ž์‹ ๊ณผ ์ด์ „ ๋‹จ์–ด๋งŒ ๋ณผ ์ˆ˜ ์žˆ์Œ

โ‘ก Encoder-Decoder Attention (Cross-Attention)

Query:      Decoder์˜ ํ•˜์œ„ ์ธต์—์„œ ์ƒ์„ฑ(์ด ๋ถ€๋ถ„๋งŒ ๋‹ค๋ฅด๊ณ  ๋‚˜๋จธ์ง€๋Š” Encoder์˜ multi-headed attention๊ณผ ๋™์ผํ•˜๊ฒŒ ๋™์ž‘)
Key, Value: ์ตœ์ƒ์œ„ Encoder์˜ ์ถœ๋ ฅ์—์„œ ์ƒ์„ฑ

โ†’ Decoder๊ฐ€ ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ์–ด๋А ๋ถ€๋ถ„์— ์ง‘์ค‘ํ• ์ง€ ๊ฒฐ์ •

์˜ˆ์‹œ:
  ๋ฒˆ์—ญ ์ค‘ "coffee"๋ฅผ ์ƒ์„ฑํ•  ๋•Œ
  โ†’ Query: "coffee" (Decoder)
  โ†’ Key/Value: "๋‚˜๋Š”", "์ปคํ”ผ๋ฅผ", "๋งˆ์…จ๋‹ค" (Encoder)
  โ†’ "์ปคํ”ผ๋ฅผ"์— ๊ฐ€์žฅ ๋†’์€ Attention โ†’ "coffee" ์ƒ์„ฑ

โ‘ข ํ•™์Šต vs ์ถ”๋ก 

ํ•™์Šต ์‹œ (Teacher Forcing):
  Decoder ์ž…๋ ฅ = ์ •๋‹ต ๋ ˆ์ด๋ธ” (ground truth)
  โ†’ ๋ชจ๋“  ํƒ€์ž„์Šคํ… ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

์ถ”๋ก  ์‹œ (Auto-regressive):
  Decoder ์ž…๋ ฅ = ์ด์ „ ์Šคํ…์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ
  โ†’ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•œ ๋‹จ์–ด์”ฉ ์ƒ์„ฑ

10. Transformer 3๊ฐ€์ง€ Attention ์ •๋ฆฌ

์ข…๋ฅ˜์œ„์น˜QueryKey/Value์—ญํ• 
Encoder Self-AttentionEncoderEncoder ์ž…๋ ฅEncoder ์ž…๋ ฅ์ž…๋ ฅ ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด ๊ด€๊ณ„ ํŒŒ์•…
Masked Decoder Self-AttentionDecoderDecoder ์ž…๋ ฅDecoder ์ž…๋ ฅ์ถœ๋ ฅ ๋ฌธ์žฅ ๋‚ด ์ด์ „ ๋‹จ์–ด๋งŒ ์ฐธ์กฐ
Encoder-Decoder AttentionDecoderDecoderEncoder ์ถœ๋ ฅ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ์–ด๋А ๋ถ€๋ถ„์— ์ง‘์ค‘ํ• ์ง€ ๊ฒฐ์ •

11. ํ•ต์‹ฌ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

๋ณ€๊ฒฝํ•ด๋„๋˜์ง€๋งŒ ์—ฐ๊ตฌ์ž๋“ค์ด ์ž˜๋‚˜์˜ค๋Š” ๊ฐ’์œผ๋กœ ์ •ํ•ด๋‘” ๊ฐ’์ž…๋‹ˆ๋‹ค.

ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ’์˜๋ฏธ
d_model512Embedding ๋ฒกํ„ฐ ํฌ๊ธฐ, ๋ชจ๋“  ์ธต ๊ณตํ†ต
num_layers6Encoder/Decoder ์ธต ์ˆ˜
num_heads8Multi-Head Attention ๋ณ‘๋ ฌ ํ—ค๋“œ ์ˆ˜
d_ff2048FFN ์€๋‹‰์ธต ํฌ๊ธฐ (d_model ร— 4)
d_k = d_v64๊ฐ Head์˜ Q/K/V ์ฐจ์› (512/8)

๐Ÿ’ก BERT: d_model=768, num_heads=12, d_ff=3072 (768ร—4)


12. ์ „์ฒด ํ๋ฆ„ ์š”์•ฝ

๐Ÿ—บ๏ธ Transformer ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ

์ž…๋ ฅ ๋ฌธ์žฅ: "์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ๋ง‘๋‹ค"

โ‘  ๋‹จ์–ด โ†’ Embedding ๋ฒกํ„ฐ (512์ฐจ์›)
โ‘ก + Positional Encoding (์œ„์น˜ ์ •๋ณด ์ถ”๊ฐ€)
โ‘ข Encoder ร—6:
    โ†’ Multi-Head Self-Attention (๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„ ํŒŒ์•…)
    โ†’ Add & LayerNorm
    โ†’ Position-wise FFN (๋‹จ์–ด ํ‘œํ˜„ ํ’๋ถ€ํ™”)
    โ†’ Add & LayerNorm
โ‘ฃ Encoder ์ตœ์ข… ์ถœ๋ ฅ โ†’ Decoder์˜ K, V๋กœ ์ „๋‹ฌ

โ‘ค ์ถœ๋ ฅ ์‹œ์ž‘ ํ† ํฐ <BOS>
โ‘ฅ + Positional Encoding
โ‘ฆ Decoder ร—6:
    โ†’ Masked Multi-Head Self-Attention (์ด์ „ ์ถœ๋ ฅ๋งŒ ์ฐธ์กฐ)
    โ†’ Add & LayerNorm
    โ†’ Encoder-Decoder Attention (์ž…๋ ฅ ๋ฌธ์žฅ ์ฐธ์กฐ)
    โ†’ Add & LayerNorm
    โ†’ Position-wise FFN
    โ†’ Add & LayerNorm
โ‘ง Linear + Softmax โ†’ ๋‹ค์Œ ๋‹จ์–ด ํ™•๋ฅ 
โ‘จ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์˜ ๋‹จ์–ด ์„ ํƒ (argmax)
โ‘ฉ <EOS>๊ฐ€ ๋‚˜์˜ฌ ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต

โš–๏ธ ํ•ต์‹ฌ ๊ฐœ๋… ๋น„๊ต

๊ฐœ๋…ํ•ต์‹ฌ ์•„์ด๋””์–ดํ•ด๊ฒฐํ•œ ๋ฌธ์ œ
Self-AttentionQยทK ์œ ์‚ฌ๋„๋กœ V ๊ฐ€์ค‘ํ•ฉ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ
Multi-Head์—ฌ๋Ÿฌ ์‹œ๊ฐ์œผ๋กœ ๋ณ‘๋ ฌ Attention๋‹ค์–‘ํ•œ ๊ด€๊ณ„ ๋™์‹œ ํŒŒ์•…
Positional Encodingsin/cos์œผ๋กœ ์œ„์น˜ ์ •๋ณด ์ถ”๊ฐ€์ˆœ์„œ ์ •๋ณด ์†Œ์‹ค
Residual Connection์ถœ๋ ฅ = F(x) + xVanishing Gradient
Layer Normalization๊ฐ ์ธต ์ž…๋ ฅ ์ •๊ทœํ™”ํ•™์Šต ๋ถˆ์•ˆ์ •
Masked Attention๋ฏธ๋ž˜ ์œ„์น˜๋ฅผ -โˆž๋กœ ์„ค์ •Decoder ์น˜ํŒ… ๋ฐฉ์ง€

๐ŸŽฏ ๋งˆ๋ฌด๋ฆฌ ํ€ด์ฆˆ

Q1. Transformer๊ฐ€ LSTM๋ณด๋‹ค ๋น ๋ฅธ ์ด์œ ๋Š”?

์ •๋‹ต: LSTM์€ ๋‹จ์–ด๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์ง€๋งŒ Transformer๋Š” ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๋™์‹œ์— ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. Attention ์—ฐ์‚ฐ์€ ํ–‰๋ ฌ ๊ณฑ์…ˆ์œผ๋กœ GPU์—์„œ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.

Q2. Self-Attention์—์„œ Scaling(โˆšdโ‚–๋กœ ๋‚˜๋ˆ„๊ธฐ)์„ ํ•˜๋Š” ์ด์œ ๋Š”?

์ •๋‹ต: QยทK์˜ ๋‚ด์ ๊ฐ’์ด ์ปค์งˆ์ˆ˜๋ก Softmax์˜ ๊ธฐ์šธ๊ธฐ๊ฐ€ ๊ฑฐ์˜ 0์ด ๋ฉ๋‹ˆ๋‹ค (๊ธฐ์šธ๊ธฐ ์†Œ์‹ค). โˆšdโ‚–๋กœ ๋‚˜๋ˆ  ๊ฐ’์„ ์ž‘๊ฒŒ ์œ ์ง€ํ•˜๋ฉด Softmax๊ฐ€ ๋” ๊ท ๋“ฑํ•œ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ ธ ํ•™์Šต์ด ์•ˆ์ •๋ฉ๋‹ˆ๋‹ค.

Q3. Decoder์—์„œ Masking์ด ํ•„์š”ํ•œ ์ด์œ ๋Š”?

์ •๋‹ต: ํ•™์Šต ์‹œ Decoder๋Š” ์ •๋‹ต ์ „์ฒด๋ฅผ ์ž…๋ ฅ๋ฐ›์Šต๋‹ˆ๋‹ค. Masking ์—†์ด๋Š” ๋ฏธ๋ž˜ ๋‹จ์–ด๋ฅผ ๋ณด๊ณ  ์˜ˆ์ธกํ•˜๋Š” ์น˜ํŒ…์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. Masking์œผ๋กœ ํ˜„์žฌ ์œ„์น˜ ์ด์ „ ๋‹จ์–ด๋งŒ ์ฐธ์กฐํ•˜๊ฒŒ ๊ฐ•์ œํ•ฉ๋‹ˆ๋‹ค.

Q4. Positional Encoding์ด ํ•„์š”ํ•œ ์ด์œ ๋Š”?

์ •๋‹ต: Transformer๋Š” ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ ์ˆœ์„œ ์ •๋ณด๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. sin/cos ํ•จ์ˆ˜๋กœ ๊ฐ ์œ„์น˜๋งˆ๋‹ค ๊ณ ์œ ํ•œ ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ Embedding์— ๋”ํ•ด์ค๋‹ˆ๋‹ค.


0๊ฐœ์˜ ๋Œ“๊ธ€