๐Ÿ“œTransformer: Attention is All You Need (2017)

hh_mon__aยท2025๋…„ 1์›” 30์ผ
0

NLP๋ชจ๋ธ

๋ชฉ๋ก ๋ณด๊ธฐ
2/5

Transformer: Introduction

  • ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP)๋Š” ์˜ค๋žœ ์‹œ๊ฐ„๋™์•ˆ ์ˆœํ™˜์‹ ๊ฒฝ๋ง(RNN, LSTM)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐœ์ „ํ•จ
  • ์ด๋Ÿฌํ•œ ๋ชจ๋ธ๋“ค์€ (1) ๊ธด ๋ฌธ์žฅ์—์„œ ์ •๋ณด๋ฅผ ์žƒ์–ด๋ฒ„๋ฆฌ๊ณ (Long-Term Dependency Problem), (2) ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์ด ์–ด๋ ต๋‹ค๋Š” ๋‹จ์  ์กด์žฌ
  • 2017๋…„, Google์—์„œ "Attention is All You Need"๊ฐ€ ๋‚˜์˜ด
  • ์ด ๋…ผ๋ฌธ์€ Self-Attention์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ Transformer ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜๋ฉด์„œ, RNN ์—†์ด๋„ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ•จ

๐Ÿ’ก๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ:

  • ๊ธฐ์กด RNN ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ํ•™์Šต ์†๋„๊ฐ€ ํ›จ์”ฌ ๋น ๋ฆ„
  • ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•ด ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์—์„œ๋„ ํšจ์œจ์ ์œผ๋กœ ๋™์ž‘
  • ๋ฒˆ์—ญ ์„ฑ๋Šฅ์—์„œ ๊ธฐ์กด ๋ชจ๋ธ์„ ๋›ฐ์–ด๋„˜์Œ
  • ์ฆ‰, "Attention๋งŒ์œผ๋กœ ์ถฉ๋ถ„ํ•˜๋‹ค" ๋ผ๋Š” ๋…ผ๋ฌธ

Transformer: Model Architecture

model architecture

๐Ÿ—๏ธํ•ต์‹ฌ ๊ฐœ๋…

1. Self-Attention

  • Transformer์˜ ํ•ต์‹ฌ์€ Self-Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜
  • ๊ธฐ์กด RNN ๋ชจ๋ธ์€ ๋‹จ์–ด๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์ง€๋งŒ, Self-Attention์€ ๋ฌธ์žฅ ์ „์ฒด๋ฅผ ํ•œ๋ฒˆ์— ๋ณด๊ณ  ๋‹จ์–ด๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ
  • ์ฆ‰, ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ์„œ๋กœ๋ฅผ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ

โœ… Self-Attention์˜ ์›๋ฆฌ:

  • ๊ฐ ๋‹จ์–ด๋Š” ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด๋“ค๊ณผ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€(๊ฐ€์ค‘์น˜)๋ฅผ ๊ณ„์‚ฐ
  • ์ด๋ฅผ ์œ„ํ•ด Query(Q), Key(K), Value(V) ์„ธ๊ฐ€์ง€ ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉ
  • Q, K, V๋Š” ๊ฐ™์€ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์—์„œ ์ƒ์„ฑํ•˜์ง€๋งŒ, ๊ฐ๊ธฐ ๋‹ค๋ฅธ ์—ญํ• ์„ ํ•จ
  • ๊ฐ™์€ ์ž…๋ ฅ์—์„œ๋งŒ ๋‚˜์˜ค์ง€๋งŒ, ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ํ†ตํ•ด ๋ณ€ํ™˜๋˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ๊ธฐ ๋‹ค๋ฅธ ์—ญํ• ์„ ํ•จ
    • Query(Q): ํ˜„์žฌ ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ๋‹จ์–ด๋ฅผ ์ฐพ์„ ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๋ฒกํ„ฐ
    • Key(K): ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์ด ํ˜„์žฌ ๋‹จ์–ด๋ฅผ ์ฐพ์„ ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๋ฒกํ„ฐ
    • Value(V): ๋‹จ์–ด๊ฐ€ ์‹ค์ œ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์˜๋ฏธ ์ •๋ณด๋ฅผ ๋‹ด๋Š” ๋ฒกํ„ฐ
  • ์ˆ˜์‹
    • QK^T: Query์™€ Key ๋‚ด์ ์„ ๊ตฌํ•ด ๊ฐ ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ๋‹จ์–ด์™€ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ ์žˆ๋Š”์ง€ ๊ณ„์‚ฐ
    • softmax: ๊ฐ€์ค‘์น˜๋ฅผ ํ™•๋ฅ  ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜
    • V: ๊ฐ€์ค‘์น˜๋ฅผ ์ ์šฉํ•œ ์ตœ์ข… ๊ฐ’
  • RNN ์—†์ด๋„ ๋ฌธ๋งฅ์„ ๋ฐ˜์˜ํ•œ ๋‹จ์–ดํ‘œํ˜„์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ

2. Multi-Head Attention

  • ๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•œ ๊ฐ€์ง€ Attention์œผ๋กœ๋งŒ ๋ณด๋ฉด ์ •๋ณด๊ฐ€ ๋ถ€์กฑํ•˜์—ฌ Multi-Head Attention์„ ๋„์ž…ํ•จ
  • Self-Attention์„ ์—ฌ๋Ÿฌ ๊ฐœ ์ ์šฉํ•ด ๋‹ค์–‘ํ•œ ๊ด€์ ์—์„œ ๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•™์Šต
  • ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋” ํ’๋ถ€ํ•˜๊ฒŒ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ
  • ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด, ๊ฐ™์€ ๋ฌธ์žฅ์ด๋ผ๋„ ๋‹ค์–‘ํ•œ ์ฐจ์›์—์„œ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ

3. Positional Encoding

  • Self-Attention์€ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•จ
  • RNN์€ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šตํ•˜์—ฌ ์ˆœ์„œ๋ฅผ ๋ฐ˜์˜ํ•˜์ง€๋งŒ, Transformer๋Š” ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด Positional Encoding์„ ์‚ฌ์šฉ
    Positional Encoding ์ˆ˜์‹
  • ํ™€์ˆ˜๋Š” ์‚ฌ์ธ(sin), ์ง์ˆ˜๋Š” ์ฝ”์‚ฌ์ธ(cos)ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ ๋‹จ์–ด์— ๊ณ ์œ ํ•œ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ถ€์—ฌ

๐Ÿ”ŽTransformer ๊ตฌ์กฐ

Encoder

1๏ธโƒฃ Input Embedding: ์ž…๋ ฅ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ชจ๋ธ์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜

2๏ธโƒฃ Positional Encoding:

  • Self-Attention์€ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ, ์‚ฌ์ธ(sin), ์ฝ”์‚ฌ์ธ(cos) ํ•จ์ˆ˜๋ฅผ ํ™œ์šฉํ•ด ๋‹จ์–ด ์ˆœ์„œ ์ •๋ณด๋ฅผ ์ถ”๊ฐ€

3๏ธโƒฃ Multi-Head Attention:

  • Self-Attention์„ ํ™œ์šฉํ•˜์—ฌ ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•™์Šต
  • ์—ฌ๋Ÿฌ ๊ฐœ์˜ Attention Head๋ฅผ ๋ณ‘๋ ฌ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๋ฌธ๋งฅ์„ ๋ฐ˜์˜

4๏ธโƒฃ Add & Norm (์ž”์—ฌ ์—ฐ๊ฒฐ ๋ฐ ์ •๊ทœํ™”)

  • Residual Connection: ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ ์ •๋ณด ์†์‹ค ๋ฐฉ์ง€, ๊ธฐ์šธ๊ธฐ ํ๋ฆ„ ์›ํ™œํ™” โ†’ ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ ํ•ด๊ฒฐ
  • Layer Normalization: ๊ฐ€์ค‘์น˜ ๊ฐ’ ๋ถ„ํฌ๋ฅผ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€ํ•˜์—ฌ ํ›ˆ๋ จ ์†๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ณ  ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€

5๏ธโƒฃ Feed-Forward Network (FFN)

  • ๊ฐœ๋ณ„ ๋‹จ์–ด์˜ ํ‘œํ˜„์„ ๋น„์„ ํ˜•์ ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋” ํ’๋ถ€ํ•œ ์˜๋ฏธ๋ฅผ ํ•™์Šต (ReLU ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ์‚ฌ์šฉ)
  • ๊ฐ ๋‹จ์–ด๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌ๋˜๋ฏ€๋กœ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ ๊ฐ€๋Šฅ

6๏ธโƒฃ Layer Stacking:

  • ์—ฌ๋Ÿฌ ๊ฐœ์˜ Encoder ๋ธ”๋ก์„ ์Œ“์•„ ๊ณ ์ฐจ์›์ ์ธ ํŒจํ„ด ํ•™์Šต
  • ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด ๊ฐ„์˜ ๋ณต์žกํ•œ ๊ด€๊ณ„๋ฅผ ๊นŠ์ด ์žˆ๊ฒŒ ํ•™์Šต

Decoder

1๏ธโƒฃ Input Embedding: ๋””์ฝ”๋”์˜ ์ž…๋ ฅ์„ ๋ฒกํ„ฐํ™”ํ•˜์—ฌ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•˜๋„๋ก ๋ณ€ํ™˜
2๏ธโƒฃ Positional Encoding: ๋‹จ์–ด ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋„๋ก ์ถ”๊ฐ€ ์ •๋ณด ์ œ๊ณต
3๏ธโƒฃ Masked Multi-Head Attention:

  • Look-ahead Mask๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ฏธ๋ž˜ ๋‹จ์–ด๋ฅผ ์ฐธ์กฐํ•˜์ง€ ์•Š๋„๋ก ํ•จ โ†’ ์˜ฌ๋ฐ”๋ฅธ ๋ฌธ์žฅ ์ƒ์„ฑ ๊ฐ€๋Šฅ
  • Decoder๊ฐ€ ์ด๋ฏธ ์ƒ์„ฑ๋œ ๋‹จ์–ด๋งŒ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ์œ ๋„

4๏ธโƒฃ Encoder-Decoder Attention:

  • Encoder์—์„œ ์ƒ์„ฑ๋œ ์ •๋ณด์™€ Decoder์˜ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์ ์ ˆํ•œ ์ถœ๋ ฅ์„ ์ƒ์„ฑ
  • ์ž…๋ ฅ ๋ฌธ์žฅ๊ณผ ์ถœ๋ ฅ ๋ฌธ์žฅ์ด ์–ด๋–ป๊ฒŒ ์—ฐ๊ฒฐ๋˜๋Š”์ง€ ํ•™์Šต

5๏ธโƒฃ Add & Norm (์ž”์—ฌ ์—ฐ๊ฒฐ ๋ฐ ์ •๊ทœํ™”)

  • Residual Connection์„ ํ†ตํ•ด ์ •๋ณด ์†์‹ค์„ ๋ฐฉ์ง€ํ•˜๊ณ , Layer Normalization์œผ๋กœ ์•ˆ์ •์ ์ธ ํ•™์Šต ์œ ๋„

6๏ธโƒฃ Feed-Forward Network (FFN)

  • ๋‹จ์–ด์˜ ํŠน์ง•์„ ๊ฐ•ํ™”ํ•˜๊ณ , ๋ณต์žกํ•œ ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋” ๋‚˜์€ ํ‘œํ˜„ ํ•™์Šต

7๏ธโƒฃ Output Layer (Softmax & Linear Projection)

  • ์ตœ์ข…์ ์œผ๋กœ Softmax๋ฅผ ์ ์šฉํ•˜์—ฌ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•  ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ณ„์‚ฐ

8๏ธโƒฃ Layer Stacking:

  • Encoder์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ Decoder ๋ธ”๋ก์„ ์Œ“์•„์„œ ์ •๊ตํ•œ ๋ฌธ์žฅ ์ƒ์„ฑ ๊ฐ€๋Šฅ

๐Ÿ’ก ์ฆ‰, Encoder๋Š” ๋ฌธ์žฅ์„ ์ดํ•ดํ•˜๊ณ , Decoder๋Š” ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ฒˆ์—ญ์„ ์ƒ์„ฑํ•˜๋Š” ์—ญํ• 

Transformer ํ•œ๊ณ„์  ๋ฐ ํ˜„์žฌ

ํ•œ๊ณ„์ 

  • ์—ฐ์‚ฐ๋Ÿ‰ ๋ฌธ์ œ: Self-Attention์˜ ๋ณต์žก๋„๊ฐ€ O(n^2)์ด๋ผ ๊ธด ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ๋น„์šฉ์ด ํผ
  • ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ: ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ, ์ž‘์€ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๋‚ฎ์•„์งˆ ์ˆ˜ ์žˆ์Œ

ํ˜„์žฌ(2025๋…„ 1์›” ๊ธฐ์ค€)

  • 2017๋…„ ๋…ผ๋ฌธ ํ•˜๋‚˜๊ฐ€ NLP ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์™„์ „ํžˆ ๋ฐ”๊ฟ” ๋†“์Œ
  • ๊ธฐ์กด RNN/LSTM์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ , Self-Attention ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ํ‘œ์ค€์ด ๋จ
  • ํ˜„์žฌ OpenAI์˜ GPT, Google's BERT, Meta์˜ LLaMA ๋“ฑ ๊ฑฐ๋Œ€ ์–ธ์–ด ๋ชจ๋ธ(LLM) ๋“ค์ด ๋ชจ๋‘ Transformer ๊ธฐ๋ฐ˜

๐Ÿ“‘Transformer ์ž๋ฃŒ

profile
๋ฐ์ดํ„ฐ๋ถ„์„/๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค/์ฝ”๋”ฉ

0๊ฐœ์˜ ๋Œ“๊ธ€