๐Ÿ“‘ Attention Is All You Need
๐Ÿ”— reference link

Summary !

๐Ÿ’ญ Self-Attention์€ ์ž๊ธฐ์ž์‹  ๋ฌธ์žฅ์— ๋Œ€ํ•œ ๋‹จ์–ด๋ผ๋ฆฌ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋ฉฐ,
Attention ํ•จ์ˆ˜์˜ ์ข…๋ฅ˜๋Š” ๋‹ค์–‘ํ•˜๋‹ค.

ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” Scaled-dot product ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ,
์ด๋•Œ, N์ฐจ์›์œผ๋กœ ์ถ•์†Œํ•œ Q, K, V ๋ฒกํ„ฐ๋ฅผ ์ด์šฉ,
N๊ฒน์˜ Attention์„ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ  concatenation์„ ํ†ตํ•ด ํ•ฉ์น˜๋Š”

Multi-head Attention ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•œ๋‹ค.

 

1. Self Attention ๐Ÿ‘€

๐Ÿ’ก ์ž…๋ ฅ ๋ฌธ์žฅ ๋‚ด์˜ ๋‹จ์–ด๋“ค๋ผ๋ฆฌ, ์ฆ‰ ์ž๊ธฐ์ž์‹  ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜์—ฌ ๋ฌธ๋งฅ์„ ํŒŒ์•…

Self Attention์˜ `Q, K, V`
 = ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค
  • Query์— ๋Œ€ํ•œ ๋ชจ๋“  Key์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐ !

 

1-1. Self Attention

Self Attention example

Basic Attention function

Attention ๊ฐœ๋…

  • Query์— ๋Œ€ํ•ด์„œ ๋ชจ๋“  Key์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐ
    ์œ ์‚ฌ๋„ = ๊ฐ€์ค‘์น˜, ๊ฐ Key์™€ ๋Œ€์‘๋˜๋Š” Value ๊ฐ’์— ๋ฐ˜์˜

    • Return Weighted Sum of Value

 

1-2. Self Attention์˜ Q, K, V ๋ฒกํ„ฐ

Self Attention example - word vector, W๋Š” ๊ฐ€์ค‘์น˜ ๋ฒกํ„ฐ

  • Self Attention : ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์„ ๊ฐ€์ง€๊ณ  ์ˆ˜ํ–‰
    • dmodeld_{model} ์ฐจ์›์˜ ๋‹จ์–ด ๋ฒกํ„ฐ โ†’ Q, K, V ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ด์šฉ
  • ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” dmodel=512d_{model} = 512 ์ฐจ์›์˜ ๋‹จ์–ด ๋ฒกํ„ฐ โ†’ 6464 ์ฐจ์›์˜ Q, K, V๋กœ ๋ณ€ํ™˜
    • num_heads = 8

 


2. Scaled dot-product Attention โž—

๐Ÿ’ก Attention mechanism

  1. ๊ฐ Q ๋ฒกํ„ฐ๋Š” ๋ชจ๋“  K ๋ฒกํ„ฐ์— ๋Œ€ํ•˜์—ฌ Attention score๋ฅผ ๊ณ„์‚ฐ ; ์œ ์‚ฌ๋„

  2. Attention score๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ชจ๋“  V ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ค‘ํ•ฉ

    Return `Context vector`
  • ์–ดํ…์…˜ ํ•จ์ˆ˜์˜ ์ข…๋ฅ˜๋Š” ๋‹ค์–‘ํ•˜๋‹ค

 

2-1. Scaled dot-product Attention

Scaled dot-product example (128, 32 is an arbitary num)

  • ์—ฐ์‚ฐ ๊ฐœ๋…

    Input
    [ I am a student ]

    1. ๋ฒกํ„ฐ๋ผ๋ฆฌ Dot-product

      • ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ โ†’ ๋ณ€ํ™˜๋œ Q, K, V ๋ฒกํ„ฐ๋ฅผ ์ด์šฉ
      • ๋‹จ์–ด โ€˜Iโ€™์— ๋Œ€ํ•œ Q ๋ฒกํ„ฐ๊ฐ€ ๋ชจ๋“  K ๋ฒกํ„ฐ์— ๋Œ€ํ•ด ์—ฐ์‚ฐ
    2. ์Šค์ผ€์ผ๋ง

      • dk=8\sqrt{d_k} = 8 ; (dk=dmodel/num_heads)(d_k = d_{model} / num\_heads)
      • ๋‹จ์–ด โ€˜Iโ€™ - ๊ฐ ๋‹จ์–ด โ€˜Iโ€™, โ€˜amโ€™, โ€˜aโ€™, โ€˜studentโ€™์™€์˜ ์—ฐ๊ด€์„ฑ์„ ์˜๋ฏธ
        • Return Attention score
    3. Attention value

      • Softmax(Attention score)
      • ๊ฐ ๋‹จ์–ด์— ๋Œ€์‘๋˜๋Š” V ๋ฒกํ„ฐ์— ๋Œ€ํ•ด ๊ฐ€์ค‘ํ•ฉ
        • Return Context vector

 

2-2. ํ–‰๋ ฌ์„ ํ†ตํ•œ ์ผ๊ด„ ์—ฐ์‚ฐ

๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•ด ๋ฒกํ„ฐ ์—ฐ์‚ฐ์„ ํ•˜๋Š” ๋Œ€์‹ , ๋ฌธ์žฅ ๋‹จ์œ„์˜ ํ–‰๋ ฌ์„ ์ด์šฉํ•˜์—ฌ ์—ฐ์‚ฐ

  • ์œ„์˜ ์—ฐ์‚ฐ ๊ฐœ๋…์— ๋ฒกํ„ฐ ๋Œ€์‹  ํ–‰๋ ฌ์„ ๋Œ€์ž…
    • Return Context matrix

Attention(Q,K,V)=softmax(QKTdk)VAttention(Q, K, V) = softmax({QK^T \over \sqrt{d_k}})V

์œ„ ์‹ค์ œ ์ˆ˜์‹์„ ํ–‰๋ ฌ๋กœ ์‹œ๊ฐํ™”

Q, K, V ํ–‰๋ ฌ ๋ณ€ํ™˜ example

Self Attention  - ๋‹จ์–ด ๋ฒกํ„ฐ ๋Œ€์‹ , ๋ฌธ์žฅ ํ–‰๋ ฌ์„ ์ ์šฉํ•œ ๋ชจ์Šต

  • ์ˆ˜์‹์— ์‚ฌ์šฉ๋œ ํ–‰๋ ฌ ํฌ๊ธฐ ์ •๋ฆฌ

    seq_lenseq\_len์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๊ธธ์ด
    ๋ฌธ์žฅ ํ–‰๋ ฌ์˜ ํฌ๊ธฐ(seq_len,dmodel)(seq\_len, d_{model})
    dkd_kQ, K ๋ฒกํ„ฐ์˜ ์ฐจ์›
    dvd_vV ๋ฒกํ„ฐ์˜ ์ฐจ์›
    Q, K ํ–‰๋ ฌ์˜ ํฌ๊ธฐ(seq_len,dk)(seq\_len, d_k)
    V ํ–‰๋ ฌ์˜ ํฌ๊ธฐ(seq_len,dv)(seq\_len, d_v)
    ์œ„์˜ ๊ฐ€์ •์— ๋”ฐ๋ฅด๋ฉด,๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์˜ ํฌ๊ธฐ ์ถ”์ • ๊ฐ€๋Šฅ
    WQ, WKW^Q,\ W^K(dmodel,dk)(d_{model}, d_k)
    WVW^V(dmodel,dv)(d_{model}, d_v)
    ์ด๋•Œ, ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด,dk=dv=dmodel/num_headsd_k = d_v = d_{model}/num\_heads ์ด๋ฏ€๋กœ,
    Attention Value Matrix(seq_len,dv)(seq\_len, d_v)

 


3. Multi-head Attention ๐Ÿคฏ

๐Ÿ’ก Why Multi-head?

์—ฌ๋Ÿฌ ๊ฐœ์˜ Attention head = `์—ฌ๋Ÿฌ ๊ฐœ์˜ ์‹œ๊ฐ`

์ฆ‰, ๋‹ค์–‘ํ•œ ์‹œ๊ฐ์—์„œ ์œ ์‚ฌ๋„ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•˜๊ธฐ ์œ„ํ•จ

 

3-1. Multi-head

Multi-head Attention example, Attention head์˜ ๊ฐœ์ˆ˜ = num_heads

  • ์–ดํ…์…˜ ํ—ค๋“œ์˜ ๊ฐœ์ˆ˜๋งŒํผ ๋ณ‘๋ ฌ ์–ดํ…์…˜ ์—ฐ์‚ฐ

    • ๊ฐ Attention ๊ฐ’ ํ–‰๋ ฌ ana_n = Attention head
      • ๊ฐ Attention head์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ WQ,WK,WVW^Q, W^K, W^V ๊ฐ’์€ ๋ชจ๋‘ ๋‹ค๋ฅธ ๊ฐ’

 

  • ์—ฐ์‚ฐ ์ ˆ์ฐจ

    1. ๊ฐ๊ธฐ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ์ด์šฉํ•œ ๋ณ‘๋ ฌ ์–ดํ…์…˜ ์ˆ˜ํ–‰

      • Return Attention head(a0,โ€ฆ,anum_heads)(a_0, โ€ฆ , a_{num\_heads})
    2. ๋ชจ๋“  Attention head ์—ฐ๊ฒฐ

      • Return concatenated matrix

        (seq_len,dmodel)(seq\_len, d_{model})

    3. ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ WOW^O ๊ณฑํ•˜๊ธฐ

      • Return Multi-head Attention Matrix (seq_len,dmodel)(seq\_len, d_{model})

        • Input ๋ฌธ์žฅ ํ–‰๋ ฌ(seq_len,dmodel)(seq\_len, d{model})๊ณผ ๋™์ผ ํฌ๊ธฐ ์œ ์ง€

๐Ÿ’ญ Matrix ํฌ๊ธฐ๊ฐ€ ์œ ์ง€๋˜์–ด์•ผ ํ•˜๋Š” ์ด์œ ?

- Transformer - ๋™์ผํ•œ ๊ตฌ์กฐ์˜ `encoder`๋ฅผ 6 layer ์Œ“์€ ๊ตฌ์กฐ
    โ‡’ ๋‹ค์Œ `encoder`์— ๋‹ค์‹œ ์ž…๋ ฅ๋˜๊ธฐ ์œ„ํ•จ
profile
์ข‹์•„ํ•˜๋Š” ๊ฒƒ ๋งŽ์€ ์‚ฌ๋žŒ

0๊ฐœ์˜ ๋Œ“๊ธ€

๊ด€๋ จ ์ฑ„์šฉ ์ •๋ณด

Powered by GraphCDN, the GraphQL CDN