Attention is All You Need (2017)

J.ยท2025๋…„ 11์›” 30์ผ

Text & Speech Papers

๋ชฉ๋ก ๋ณด๊ธฐ
12/12

โœ” Basic Info

๐Ÿ“Œ Attention Is All You Need (2017)
๐Ÿ”— https://arxiv.org/pdf/1706.03762

์›Œ๋‚™ ํ˜„๋Œ€ ๋”ฅ๋Ÿฌ๋‹ ๊ทผ๊ฐ„์ด ๋˜๋Š” ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์— ์‹คํ—˜ ํ•˜๋‚˜ํ•˜๋‚˜ ๋ณด๋‹ค๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ์— ์ดˆ์ ๋งž์ถ˜ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ.

โ˜‘๏ธTransformer ๋ชจ๋ธ ๊ตฌ์กฐ ์„ค๋ช…

  • Transformer ๋Š” Encoder, Decoder ๋ฅผ ๊ฐ๊ฐ 6๊ฐœ์”ฉ ๋™์ผ ๋ ˆ์ด์–ด๋ฅผ ์Œ“์€ ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ๋œ ๋ชจ๋ธ

โ˜‘๏ธPositional Encoding (PE)

์ž…๋ ฅ์‹œ ํŠธ๋ Œ์Šคํฌ๋จธ ๊ตฌ์กฐ๋Š” ๋‹จ์–ด ์ˆœ์„œ ์ •๋ณด๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— sin, cos ๊ธฐ๋ฐ˜ ํฌ์ง€์…”๋„ ์ธ์ฝ”๋”ฉํ•œ ๋ฒกํ„ฐ๋ฅผ input, output ์ธ์ฝ”๋”ฉ ๋ฒกํ„ฐ์™€ ๋”ํ•ด ์œ„์น˜ ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•ด์ค€๋‹ค.

  • Positional Encoding์€ embedding ์ฐจ์›(d_model)์„ ๊ธฐ์ค€์œผ๋กœ ์ง์ˆ˜ ์ธ๋ฑ์Šค์—๋Š” sin, ํ™€์ˆ˜ ์ธ๋ฑ์Šค์—๋Š” cos ๊ฐ’์„ ๋„ฃ๋Š”๋‹ค.

  • pos: ๋‹จ์–ด์˜ ๋ฌธ์žฅ ๋‚ด ์œ„์น˜ (0, 1, 2, 3, โ€ฆ)
  • i: embedding ๋ฒกํ„ฐ์˜ ์œ„์น˜ - ๊ธธ์ด๊ฐ€ ๋„ˆ๋ฌด ๊ธธ๋ฉด ์œ„์น˜ ๋ฒกํ„ฐ๊ฐ€ 0 ์ด ๋  ์ˆ˜ ์žˆ์Œ.
    ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๊ธธ์ด์ธ d_model ๋กœ ๋‚˜๋ˆ„๋Š” ์ž‘์—…์„ ํ•ด์ฃผ๋Š” ๊ฒƒ
  • sin, cos ๋ฒˆ๊ฐˆ์•„๊ฐ€๋ฉฐ ํ•ด์ฃผ๋Š” ์ด์œ  = ๊ธด ๋ฌธ์žฅ๋„ ๊ฒน์น˜์ง€ ์•Š๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ
  • d_model: ๋ชจ๋ธ์˜ embedding ํฌ๊ธฐ

์˜ˆ์‹œ : i ์™€ pos ์ฐจ์ด ๋ณด์—ฌ์ฃผ๋Š” ํ•„๊ธฐ๋ณธ

โ˜‘๏ธMulti-Head Attention

ํ•œ๊ฐœ์˜ Head ๊ฐ€ ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ head ๋ฅผ ๋‘์–ด Q,K,V Attention ์ˆ˜ํ–‰ํ•œ๋‹ค -> ๋ฌธ์žฅ์˜ ๋‹ค์–‘ํ•œ ํ‘œํ˜„ ๋‹ค์–‘ํ•œ ๊ด€์ ์—์„œ ํ•™์Šต

์ข…๋ฅ˜์–ด๋””์„œ ์“ฐ๋Š”๊ฐ€Self-Attention?Q/K/V์˜ ์ถœ์ฒ˜
โ‘  Encoder Self-AttentionEncoderโœ…Q=K=V=Encoder ์ž…๋ ฅ
โ‘ก Decoder Self-Attention (Masked)Decoderโœ…Q=K=V=Decoder ์ž…๋ ฅ
โ‘ข Encoderโ€“Decoder Cross-AttentionDecoderโŒ Self ์•„๋‹˜Q=Decoder, K=V=Encoder

Q,K,V

  • Q(์ฟผ๋ฆฌ) : ๋ฌด์—‡์ด ์•Œ๊ณ  ์‹ถ์€์ง€
  • K(ํ‚ค): ๊ฐ ๋‹จ์–ด๊ฐ€ โ€œ๋‚˜์˜ ํŠน์ง•์€ ์ด๊ฑฐ์•ผ!โ€ ํ•˜๊ณ  ๋“ค๊ณ  ์žˆ๋Š” ์„ค๋ช…์„œ (์‹ค์ œ ๊ฐ’์ธ V ์˜ ํžŒํŠธ๋ฅผ ์ฃผ๋Š” ๊ฐ’)
  • V(๊ฐ’) : ๋‚ด์šฉ ์ •๋ณด ๋ฒกํ„ฐ (์‹ค์ œ ๋‚ด์šฉ)

Q = XW_Q
K = XW_K
V = XW_V

์ด๋•Œ Q,K,V ์˜ Weight ์€ ํ•œ ํ† ํฐ์—์„œ๋„ ํ•ด๋‹น ๊ธฐ๋Šฅ์„ ํ•˜๋„๋ก ๊ฐ™์€ ์ž„๋ฒ ๋”ฉ์—์„œ ์ถœ๋ฐœํ•ด์„œ -> ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜๋˜์–ด ๋‹ค๋ฅธ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋จ

Self-Attention

๋‹จ์–ด ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋งํ•จ. ์ด๋•Œ Q,K,V ๋Š” ๊ฐ™์€ ์ธํ’‹ X ์—์„œ ๋‚˜์˜จ๋‹ค

Cross-Attention

Q: ๋””์ฝ”๋” ์ชฝ ์ •๋ณด (์ƒ์„ฑ ์ค‘ ๋ฌธ์žฅ)
K,V: ์ธ์ฝ”๋” ์ชฝ (์ž…๋ ฅ ๋ฌธ์žฅ)
๋‚ด๊ฐ€ ์ƒ์„ฑ์ค‘์ธ ๋‹จ์–ด (Q) ์— ๋Œ€ํ•ด ์ž…๋ ฅ ๋ฌธ์žฅ (K,V) ์–ด๋””๋ฅผ ์ฐธ๊ณ ํ• ์ง€ ์ƒ๊ด€ ํŒ๋‹จ

Attention

Q,K ๊ฐ€ ๋‚ด์ ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๋น„์Šทํ•œ์ง€ ์ธก์ •
์ด๋ฅผ softmax ์”Œ์›Œ์„œ ๊ฐ€์ค‘์น˜๋กœ ๋ณ€ํ™˜ํ•ด ์‹ค์ œ ์ •๋ณด์ธ V ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ€์ ธ์˜ฌ์ง€ ํŒ๋‹จ

ex.
output(love) = 0.1V(I) + 0.6V(love) + 0.3V(cats) ์ด๋Ÿฐ ๊ฒฐ๋ก ์„ ๋‚ด๊ธฐ ์œ„ํ•ด ๊ฐ€์ค‘์น˜๋ฅผ q,k ์œ ์‚ฌ๋„ ์‹ค์ œ ์ •๋ณด V ๊ณฑํ•œ๋‹ค๋Š” ๋œป

  • dk ๋กœ ๋‚˜๋ˆ„๋Š” ์ด์œ  : ์ž„๋ฒ ๋”ฉ ์‚ฌ์ด์ฆˆ๋กœ ๋‚˜๋ˆ„๋Š”๋ฐ ๊ทธ ์ด์œ ๋Š” ์ž„๋ฒ ๋”ฉ ๊ธธ์ด๊ฐ€ ๋ฌดํ•œ์ • ์ปค์ง€๋ฉด attention score ๋„ˆ๋ฌด ์ปค์ง -> ๊ธธ์ด ๊ธธ๋‹ค๊ณ  ๋ฐ˜์˜ ๋„ˆ๋ฌด ๋งŽ์ด ํ•˜๋ฉด ์˜๋ฏธ ํŒ๋‹จ ์ œ๋Œ€๋กœ ์•ˆ๋˜๋‹ˆ๊นŒ ๊ธธ์ด๋กœ ๋‚˜๋ˆ„๋Š” ๊ฑธ๋กœ ์ •๊ทœํ™” ํ•ด์ฃผ๋Š” ๊ฒƒ.

Multi-Head Attention

  • Q,K,V ๋ฅผ ๊ฐ๊ฐ ๋‹ค๋ฅธ Linear Layer ๋ฅผ ํ†ต๊ณผ์‹œ์ผœ ์„œ๋กœ ๋‹ค๋ฅธ ๊ด€์  (Head) ๋ฅผ ๋งŒ๋“ค๊ณ  ๊ฐ Attention ์—ฐ์‚ฐ์„ ํ•œ ๋’ค concat ์„ ํ•œ๋‹ค.

  • ์—ฌ๋Ÿฌ ์‹œ๊ฐ์—์„œ Q, K,V ๋ฅผ ๋™์‹œ์— ๋ณด๊ณ  ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹

  • ๊ฐ™์€ ์ž…๋ ฅ ๋ฒกํ„ฐ๋ฅผ ๊ฐ head๋งˆ๋‹ค ์„œ๋กœ ๋‹ค๋ฅธ ์„ ํ˜•๋ณ€ํ™˜์œผ๋กœ ํˆฌ์˜ํ•ด์„œ ๋” ์ž‘์€ ์ฐจ์›์˜ ์—ฌ๋Ÿฌ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ค๊ฒŒ ๋จ.

  • ์˜ˆ๋ฅผ ๋“ค์–ด 512 ์ฐจ์›์˜ ๋ฒกํ„ฐ๊ฐ€ ๊ฐ QKV ๋ฒกํ„ฐ์˜€์œผ๋ฉด ์ •๋ณด ๋ชจ๋‘ ํฌํ•จํ•œ ๊ฐ 64์ฐจ์› ๋ฒกํ„ฐ๋กœ ํˆฌ์˜์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์ž„.

  • ์›๋ณธ ๊ทธ๋Œ€๋กœ ๋ณด์กด์ด ๋ชฉ์ ์ด ์•„๋‹ˆ๋ผ ํŠน์ง• ์ถ”์ถœ์ด ๋ชฉ์ ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ด ๋ฐฉ๋ฒ•์ด ์œ ์šฉํ–ˆ๋‹ค.

  • Head ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋ฌด์กฐ๊ฑด ๋งŽ๋‹ค๊ณ  ์ข‹์€๊ฑด ์•„๋‹˜. ๋‹ค๋งŒ Head ๊ฐ€ ๋งŽ์œผ๋ฉด ์ฃผ์š” ํŠน์ง• ์ถ”์ถœ์— ์œ ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต์— ์œ ์šฉ

Masked Multi-Head Attention

  • Multi-head attention = self-attention์„ ํ•˜๋˜, head๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ๋‘ฌ์„œ ๋‹ค์–‘ํ•œ ๊ด€๊ณ„์™€ ์ •๋ณด๋ฅผ ํ‘œํ˜„.

  • Head ์—ฌ๋Ÿฌ๊ฐœ ๋‘ฌ์„œ ๊ฐ head ๊ฐ€ ๋‹ค๋ฅธ ํŒจํ„ด์„ ์ฐพ๊ฒŒ ํ•˜๋Š” ๊ฑด ๋™์ผํ•œ๋ฐ ๋””์ฝ”๋” ๋ถ€๋ถ„์—์„œ ์ถœ๋ ฅ์‹œ ์ดํ›„ ๋‚˜์˜ฌ ๋‹จ์–ด๋ฅผ ์ง€์›Œ๋†”์„œ ๋’ท๋‚ด์šฉ์„ ์˜ˆ์ธกํ•˜๊ฒŒ ๋‘๋Š” ๋ฐฉ์‹

  • ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ๋Š” ์ž…๋ ฅ - ์ถœ๋ ฅ ์ •๋‹ต์ด ์žˆ๋Š” ๊ฒƒ์„ ํ•™์Šต์‹œํ‚ค๋‹ˆ๊นŒ, ์ •๋‹ต์„ ๋ณด๊ณ  ํ•™์Šต๋˜์ง€ ์•Š๊ฒŒ ํ•ด๊ฐ€ ์œ„ํ•ด masking ์„ ํ•ด์ฃผ๋Š” ๊ฒƒ

  • ์ถ”๋ก  ์‹œ์—๋„ ํ•œ ํ† ํฐ์”ฉ ์ƒ์„ฑํ•  ๋•Œ ์ด์ „ ํ† ํฐ๋งŒ ๋ณด๋„๋ก ๊ทœ์น™์„ ์œ ์ง€

ex. [๋‚˜๋Š”] (์˜ค๋Š˜, ๋ฐฅ์„ ์„ ๋ชป๋ด„)
[๋‚˜๋Š”], [์˜ค๋Š˜] (๋ฐฅ์„ ์„ ๋ชป๋ด„)
[๋‚˜๋Š”],[์˜ค๋Š˜],[๋ฐฅ์„] (์ด ๋’ค์— ๋ชป๋ด„)

  • ๋งˆ์Šคํ‚น ํ• ๋•Œ masked_scores=scores+mask
    ์ด๋•Œ mask ์•ˆํ•  ๋ถ€๋ถ„์€ mask = 0, masking ํ•  ๋ถ€๋ถ„์€ - ๋ฌดํ•œ์œผ๋กœ ์ง€์ •ํ•ด์„œ ๋”ํ•จ

โ˜‘๏ธResidual Connection (๊ทธ๋ฆผ์—์„œ๋Š” Add)

Attention ๊ณ„์‚ฐ ๋ฐ FFNN ์ดํ›„ ์›๋ณธ ์ž…๋ ฅ x ๋ฅผ ๋”ํ•œ๋‹ค.

output = LayerNorm(x + Sublayer(x))
์ด๋•Œ ์ธํ’‹ x ์™€ Layer ํ†ต๊ณผํ•œ ๊ฐ’์„ ๋”ํ•ด์คŒ
์ด์œ  : Vanishing Gradient ๋ฌธ์ œ ํ•ด๊ฒฐ ์œ„ํ•ด, ํ•™์Šต ์†๋„ ๊ฐœ์„ ๊ณผ ์ˆ˜๋ ด์„ ์œ„ํ•ด์„œ
F(x) + x

โ˜‘๏ธLayer Normalization (๊ทธ๋ฆผ์—์„œ๋Š” Norm)

x: ํ•œ ํ† ํฐ์˜ hidden vector ์ „์ฒด (์˜ˆ: 512์ฐจ์›)
E[x]: ๊ทธ ๋ฒกํ„ฐ์˜ ํ‰๊ท 
Var(x): ๊ทธ ๋ฒกํ„ฐ์˜ ๋ถ„์‚ฐ
ฮณ, ฮฒ: ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์Šค์ผ€์ผ(scale) ยท ์‰ฌํ”„ํŠธ(shift) ํŒŒ๋ผ๋ฏธํ„ฐ
ฮต: ๋ถ„๋ชจ๊ฐ€ 0 ๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๋Š” ์ž‘์€ ๊ฐ’

์ฆ‰, ๊ฐ ํ† ํฐ๋ณ„ ๋ฒกํ„ฐ๋ฅผ ์ •๊ทœํ™”ํ•˜๋Š” ๊ฒƒ์ด Layer Normalization

Attention - FFN - Residual Connection ๋ฐ˜๋ณตํ•˜๋ฉด ๊ฐ’์ด ํญ๋ฐœํ•˜๊ธฐ ์‰ฌ์›€.
Layer Normalization ์œผ๋กœ ๊ฐ ๋‹จ๊ณ„ ๊ฐ’์˜ scale ์„ ์•ˆ์ •ํ™” ์‹œ์ผœ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“  ๊ฒƒ.

โ˜‘๏ธFinal Thoughts

Transformer ๊ตฌ์กฐ๋ฅผ revisit ํ•ด๋ณด๋‹ˆ, ๊ณผ์—ฐ ํ˜„๋Œ€ ๋”ฅ๋Ÿฌ๋‹ ๊ตฌ์กฐ์˜ ๋ฐ”์ด๋ธ”์ด ๋งž๋‹ค๋Š” ์ƒ๊ฐ์ด ๋งŽ์ด ๋“ ๋‹ค. ์–ดํ…์…˜ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํŠธ๋ Œ์Šคํฌ๋จธ ๊ตฌ์กฐ ์ž์ฒด ์ „๋ฐ˜์ด ์–ด๋–ค ๋”ฅ๋Ÿฌ๋‹ ํ”„๋กœ์ ํŠธ์—์„œ๋“  ์•„์ด๋””์–ด๋กœ ์“ฐ์ธ๋‹ค๋Š”๊ฒŒ ์ข€ ๊ณต๋ถ€ํ•˜๊ณ  ๋ณด๋‹ˆ ๋ณด์ธ๋‹ค.

์žŠ์„๋งŒ ํ•˜๋ฉด ๋‹ค์‹œ ์”น์–ด๋จน์œผ๋Ÿฌ ์™€์•ผ ํ•˜๋Š” ์ •๋ง ์ค‘์š”ํ•œ ๊ตฌ์กฐ๊ตฌ๋‚˜ ์‹ถ์—ˆ๋‹ค. ๋ด๋„ ๋ด๋„ ์ƒˆ๋กœ์šด๊ฒŒ ๋ณด์ธ๋‹ค.

์˜ฌํ•ด ์ฒ˜์Œ ๋ญฃ๋„ ๋ชจ๋ฅด๊ณ  ๋™์•„๋ฆฌ ๋“ค์–ด์™€์„œ ํ”„๋กœ์ ํŠธ ํ• ๋•Œ๋Š” ๋„ˆ๋ฌด๋‚˜ ์–ด๋ ต๊ฒŒ ๋А๊ปด์กŒ๋˜ ๊ตฌ์กฐ๊ฐ€ ๊ทธ๋ž˜๋„ ์ดํ•ด๊ฐ€ ๊ฐ„๋‹ค๋Š”๊ฒŒ ์‹ ๊ธฐํ•˜๊ธฐ๋„ ํ•˜๊ณ , ์—ฌ๋Ÿฌ๋ชจ๋กœ ๊ฒฉ์„ธ์ง€๊ฐ์„ ๋А๋ผ๊ฒŒ ๋งŒ๋“œ๋Š” ๋…ผ๋ฌธ.
1๋…„์ด ๋Š˜ ์งง๋‹ค๊ณ ๋งŒ ์ƒ๊ฐํ–ˆ๋Š”๋ฐ, ์ƒ๊ฐ๋ณด๋‹ค ์ •๋ง ๊ธธ๊ตฌ๋‚˜.

์ถ”๊ฐ€ ์˜๋ฌธ์ ๋“ค ํ•ด๊ฒฐ (2026/03/26)

  • ๋งˆ์Šคํฌ๋Š” ํŒจ๋”ฉ ๋งˆ์Šคํฌ ํ˜•์‹ (0, - ๋ฌดํ•œ)
  • masked score=score+mask ์ด๋‹ˆ๊นŒ mask ์•ˆํ•˜๋Š”๊ฑด 0 ์žˆ๋Š”๊ฑด -๋ฌดํ•œ.
  • softmax ํ†ต๊ณผํ•˜๊ธฐ์— - ๋ฌดํ•œ ํ•˜๋ฉด 0

  • Attention ๋ชจ๋“ˆ์€ ๊ฐ€์ค‘์น˜ ๋ถ„๋ฐฐ๋ฅผ ๋น„์„ ํ˜•์ ์œผ๋กœ ๊ฒฐํ•ฉ, FFN ๋Š” ํ‘œํ˜„ ์ž์ฒด๋ฅผ ๋น„์„ ํ˜•์ ์œผ๋กœ ๊ฒฐํ•ฉ

  • FN ์—ญํ• :
    - Attention ๋ถ€๋ถ„๋“ค์€ ์ •๋ณด๋“ค์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•˜์ง€๋งŒ ์ฐธ๊ณ ํ•ด์„œ ์–ป์€ ํ‘œํ˜„์„ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜ํ• ์ง€์— ๋Œ€ํ•ด ์ ์šฉ

    • ์ข€ ๋” ํ’๋ถ€ํ•œ ๋ฒกํ„ฐ ํ‘œํ˜„ (feature representation) ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ
    • Transformer์˜ FFN์€ ๊ฐ ํ† ํฐ ์œ„์น˜์— ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ๋˜๋Š” 2-layer MLP
  • Add & Norm

Add ๋Š” Residual Connection (F(x) + x) ์ฆ‰ ์–ดํ…์…˜ ํ†ต๊ณผํ•œ๊ฑฐ + ์› ์ •๋ณด

Norm ์€ (Layer Norm)

ํ•˜๋Š” ์ด์œ ๋Š” ๋งŒ์•ฝ ํ•œ ๊ฐ’์ด ๊ทน๋‹จ์ ์œผ๋กœ ํฌ๋ฉด ์ •๊ทœํ™” ์•ˆ๋œ ์ƒํƒœ์—์„œ ์ง„ํ–‰ํ•˜๋ฉด ํ‘œํ˜„์˜ ์ง€๋‚˜์นœ ์ฐจ์ด ๋•Œ๋ฌธ์— ์ „๋ฐ˜์ ์ธ ์ •๋ณด ํ‘œํ˜„๋ ฅ์˜ ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์•„์ง.

Transformer์˜ ๊ฐ ๋ ˆ์ด์–ด๋งˆ๋‹ค LayerNorm์ด ๋“ค์–ด๊ฐ€๊ณ , ๊ทธ LayerNorm์€ ๊ฐ ํ† ํฐ ๋ฒกํ„ฐ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”

Positional Encoding ์ด ํ•„์š”ํ•œ ์ด์œ  ์ถ”๊ฐ€๋กœ

  • self attention ๊ณผ์ • ์ค‘์— ํ† ํฐ ๊ฐ„ ๊ด€

  • ๋ฌธ์žฅ ์‹œํ€€์Šค ์ „์ฒด๋ฅผ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌํ•จ (self-attention ๊ณผ์ •์œผ๋กœ) ๋”ฐ๋ผ์„œ ์ˆœ์„œ ์ •๋ณด๋ฅผ ๋ถ€์—ฌํ•ด์ค˜์•ผ ํ•จ

  • ์‹œํ€€์Šค ๋‚ด์—์„œ ๊ด€๊ณ„ ์ •๋ณด ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ ˆ๋Œ€์  ๋ฐ ์ƒ๋Œ€์  ์œ„์น˜๋Š” ์œ ์‹ค๋˜๋‹ˆ๊นŒ PE ์“ฐ๋Š”๊ฑฐ

profile
AI & Languages galore.

0๊ฐœ์˜ ๋Œ“๊ธ€