๐Ÿ“ Week 2: Transformer... Attention!

oceannยท2024๋…„ 8์›” 16์ผ
0

๐Ÿ’ป Naver Boostcamp AI Tech 7๊ธฐ NLP

๋ชฉ๋ก ๋ณด๊ธฐ
2/5
post-thumbnail

Attention ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๋‰ด์ง„์Šค Attention ๋“ฃ๋Š” ์‚ฌ๋žŒ ๋‚˜์•ผ ๋‚˜~
์ •๋ฆฌํ•˜๋ฉด์„œ ํ•˜๋ฃจ์ข…์ผ ๊ฝ‚ํ˜€์„œ ๋‰ด์ง„์Šค ๋…ธ๋ž˜ ๋ฐ˜๋ณต์žฌ์ƒ ํ–ˆ๋‹ค~๐ŸŽต๐Ÿ”Š(tmi)

๐Ÿ’ก
์ด๋ฒˆ ์ฃผ๋ถ€ํ„ฐ ๋ฒŒ์จ Transformer๋ฅผ ๋ฐฐ์šธ ์ค„์€...
ํ•™๋ถ€ ์—ฐ๊ตฌ์ƒ ์‹œ์ ˆ(?)์— Attention Is All You Need๋ฅผ ์‹œ๋„ํ•œ ์ ์ด ์žˆ๋Š”๋ฐ Q, K, V์— ๋Œ€ํ•œ ๊ฐœ๋…์ด ์ž˜ ์žกํžˆ์ง€ ์•Š์•„ ์• ๋งคํ•˜๊ฒŒ ์ดํ•ดํ•˜๊ณ  ๋„˜๊ฒผ๋˜ ๊ธฐ์–ต์ด ์žˆ๋‹ค.
๊ฐ•์˜๋ฅผ ๋“ค์œผ๋ฉฐ '๊ทธ๋Ÿผ ์ด๊ฑด?', '์˜ค์ž‰ ์ €๊ฑด?๐Ÿ‘€' ํ–ˆ๋˜ ๋ถ€๋ถ„๋“ค์„ ๋งˆ์น˜ ์˜ˆ์ƒํ•˜์…จ๋‹ค๋Š” ๋“ฏ์ด ์„ค๋ช…ํ•ด์ฃผ์…”์„œ ์‹ ๊ธฐํ•˜๊ณ  ๊ฐ์‚ฌํ–ˆ๋‹ค. ๋‚˜๋Š” ์–ผ๋งˆ๋‚˜ ๊ณต๋ถ€ํ•ด์•ผ ์ €๋Ÿฐ ๊ฑธ ์˜ˆ์ƒํ•  ์ •๋„๊ฐ€ ๋  ๊ฒƒ์ธ์ง€...ใ…‹ใ…‹..ใ… 
๊ทธ๋ž˜๋„ ์ƒˆ๋กœ์šด ๊ฑธ ์•„๋Š” ๊ฒƒ์€ ๋Š˜ ์ฆ๊ฒ๊ธฐ์— ์žฌ๋ฐŒ๋‹ค!
๋‹ค์Œ์ฃผ์— ํŒ€์›๋“ค๊ณผ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํ•˜๊ธฐ๋กœ ํ–ˆ์œผ๋‹ˆ ๊ฐ•์˜๋กœ ๋‹ค์ง„ ๊ฐœ๋…์„ ๋…ผ๋ฌธ์œผ๋กœ ๊ณต๋ถ€ํ•˜๋ฉฐ ์ œ๋Œ€๋กœ ์ตํ˜€์•ผ๊ฒ ๋‹ค. ์ด๋ฒˆ์—” ๋ฐ˜๋“œ์‹œ Attention ๊ฐœ๋…์„ ์ œ๋Œ€๋กœ ์ฑ™๊ฒจ๊ฐˆ ๊ฒƒ์ด๋‹ค!!๐Ÿ”ฅ


Linear Regression์˜ ๊ฐ€์ •

์‹œ์ž‘ํ•˜๊ธฐ ์ „์—!!! ์„ ํ˜•์„ฑ, ๋…๋ฆฝ์„ฑ, ๋“ฑ๋ถ„์‚ฐ์„ฑ, ์ •๊ทœ์„ฑ์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ๐Ÿ”—Week 1์— ์ถ”๊ฐ€ํ–ˆ์œผ๋‹ˆ ๋งํฌ ํƒ€๊ณ  ๊ฐ€์„œ ํ™•์ธํ•˜๊ธฐ!
์˜ค๋Š˜ ๋‚ด์šฉ์—์„œ ์ค‘์š”ํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๊ณ  ๋ฐฐ์šด ๋‚ด์šฉ์„ ๋‹ค์‹œ ์ •๋ฆฌํ•  ๊ฒธ ์ €๋ฒˆ ์ฃผ ํฌ์ŠคํŠธ์™€ ์—ฐ๊ฒฐํ•ด๋†“์•˜๋‹ค.
๋‚ด ๊ธฐ๋ก์šฉ์ด๋ฏ€๋กœ ๋ฌด์‹œํ•ด๋„ ๋ฌด๋ฐฉ!


Transformer Introduction

โญAttention Is All You Needโญ ๋…ผ๋ฌธ์—์„œ ๋“ฑ์žฅํ•œ Transformer ๋ชจ๋ธ์€ encoder์™€ decoder ๊ตฌ์กฐ๋กœ, ๋ฌธ์žฅ๊ณผ ๊ฐ™์€ ์ˆœ์ฐจ์  ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ ๋‹จ์–ด์˜ ๊ด€๊ณ„๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๋งฅ๋ฝ๊ณผ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. ๊ธฐ์กด์— ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ์€ RNN ๊ธฐ๋ฐ˜(LSTM, GRU ๋“ฑ)์ด๊ฑฐ๋‚˜ Convolution์„ ์‚ฌ์šฉํ–ˆ์—ˆ๋Š”๋ฐ, Transformer ๋ชจ๋ธ์€ ์ด ๊ตฌ์กฐ์—์„œ ๋ฒ—์–ด๋‚˜ ์ˆœ์ „ํžˆ self-attention์— ๊ธฐ๋ฐ˜ํ•œ ๋ชจ๋ธ์ด๋‹ค.
Transformer ๋ชจ๋ธ์„ ์ •ํ™•ํžˆ ์•Œ๊ธฐ ์œ„ํ•ด ํŒ€์›๋“ค๊ณผ Attention Is All You Need ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ํ•˜๊ธฐ๋กœ ํ–ˆ๋Š”๋ฐ, ๊ทธ ์ „์— Transformer ๋ชจ๋ธ์ด ๋“ฑ์žฅํ•˜๊ฒŒ ๋œ ๋ฐฐ๊ฒฝ๋ถ€ํ„ฐ ์ฐจ๊ทผํžˆ ์ •๋ฆฌํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค.

RNN(Recurrent Neural Network)

sequentialํ•œ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์˜ ์›์กฐ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด๋‹ค.

์ถœ์ฒ˜: https://www.analyticsvidhya.com/blog/2022/03/a-brief-overview-of-recurrent-neural-networks-rnn/

๊ทธ๋ฆผ์—์„œ ์ฒซ ๋ฒˆ์งธ ์…€์„ ๋ณด๋ฉด, input x๊ฐ€ hidden state์— ์ž…๋ ฅ๋œ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค์‹œ hidden state๋กœ ์ „๋‹ฌ๋œ๋‹ค. ์ด ๊ณผ์ •์ด ๋ฐ”๋กœ Recurrentํ•œ ๊ณผ์ •์ด๋‹ค.
์ด์ „ ๋‹จ๊ณ„ t-1์˜ hidden state๋ฅผ ๋‹ค์Œ ๋‹จ๊ณ„ t์˜ hidden state๋กœ ์ „๋‹ฌํ•˜๋ฉฐ, ์ด๋•Œ ๊ณผ๊ฑฐ์˜ ์ •๋ณด๊ฐ€ ์ดํ›„์˜ timestep๊นŒ์ง€ ์ „ํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์— squencetialํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.
์ด๋ ‡๊ฒŒ input x๊ฐ€ hidden state์— ์ž…๋ ฅ๋  ๋•Œ, hidden state์—์„œ hidden state๋กœ ์ „๋‹ฌ๋  ๋•Œ, hidden state์—์„œ ์ถœ๋ ฅ์„ ๋‚ผ ๋•Œ ๊ฐ๊ฐ ์•„๋ž˜์™€ ๊ฐ™์€ ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

Wxh,Whh,WhyW_{xh}, W_{hh}, W_{hy}

๊ฐ ๊ณ„์ธต(xx, hh, yy)๋ผ๋ฆฌ๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๊ณต์œ ํ•˜์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ๊ณ„์ธต๋ผ๋ฆฌ๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๊ณต์œ ํ•˜์ง€ ์•Š๋Š”๋‹ค.
RNN์˜ ๋ฌธ์ œ๋Š” ์—ฌ๊ธฐ์„œ ๋ฐœ์ƒํ•œ๋‹ค. ์ž…๋ ฅ์˜ sequence ๊ธธ์ด๋Š” ๋ชจ๋ธ์˜ depth์ฒ˜๋Ÿผ ์ž‘์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— sequence๊ฐ€ ๋„˜์–ด๊ฐˆ ๋•Œ๋งˆ๋‹ค hidden state๋ฅผ ํ†ต๊ณผํ•  ๋•Œ WhhW_{hh}๊ฐ€ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ณฑํ•ด์ง„๋‹ค. ๋”ฐ๋ผ์„œ Back Propagation์„ ํ•  ๋•Œ ์ด WhhW_{hh}๊ฐ€ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ณฑํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ํ•ด๋‹น ๊ฐ’์˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ Vanishing/Exploding Gradient Problem์ด ๋ฐœ์ƒํ•œ๋‹ค.
๋จผ์ €, Vanishing Gradient Problem์˜ ๊ฒฝ์šฐ, WhhW_{hh}๊ฐ€ 0์— ๊ฐ€๊นŒ์šด ๊ฐ’์„ ๊ฐ€์งˆ ๋•Œ ๋ฐœ์ƒํ•œ๋‹ค. ๋งˆ์ง€๋ง‰ timestep t์—์„œ ์ฒซ ๋ฒˆ์งธ timestep์œผ๋กœ ๊ฐ€๋Š” ๋™์•ˆ WhhW_{hh}๊ฐ€ ์ด t-1๋ฒˆ ๊ณฑํ•ด์ง€๋Š”๋ฐ, ์ด๋•Œ ์•ž ๋‹จ์˜ layer์˜ ๊ฐ’์ด ์ ์  ์ž‘์•„์ง€๊ธฐ ๋•Œ๋ฌธ์— ์ œ๋Œ€๋กœ ๋œ ์ •๋ณด ์ „๋‹ฌ์ด ์ด๋ฃจ์–ด์ง€์ง€ ์•Š๋Š”๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ณผ๊ฑฐ ์ •๋ณด์˜ ์†Œ์‹ค์ด ๋ฐœ์ƒํ•œ๋‹ค.
๋ฐ˜๋Œ€๋กœ, Exploding Gradient Problem์€ WhhW_{hh}์˜ ๊ฐ’์ด ๋งค์šฐ ํด ๋•Œ ๋ฐœ์ƒํ•œ๋‹ค. Back Propagation์„ ์ˆ˜ํ–‰ํ•  ๋•Œ ๋ฐ˜๋ณต์ ์œผ๋กœ ํฐ ๊ฐ’์ด ๊ณฑํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€์ค‘์น˜๊ฐ€ ๋งค์šฐ ํฐ ๊ฐ’์œผ๋กœ ๊ฐฑ์‹ ๋˜์–ด ํ•™์Šต ๊ณผ์ •์ด ๋ถˆ์•ˆ์ •ํ•ด์ง„๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ์ด ๋ฐœ์‚ฐํ•œ๋‹ค.

LSTM

RNN์˜ Vanishing/Exploding Gradient Problem์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด LSTM์ด ๋“ฑ์žฅํ•œ๋‹ค. LSTM์€ ๊ธฐ๋ณธ์ ์œผ๋กœ RNN์˜ ๊ตฌ์กฐ์™€ ๊ฐ™์œผ๋‚˜, ์žฅ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์œ„ํ•œ cell state์™€ ์„ธ ๊ฐ€์ง€์˜ gate๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ๋‹ค.

์ถœ์ฒ˜: https://docs.likejazz.com/lstm/

Cell State cc
ํ˜„์žฌ์˜ timestep์„ t๋ผ๊ณ  ํ–ˆ์„ ๋•Œ ๊ณผ๊ฑฐ๋กœ๋ถ€ํ„ฐ ํ˜„์žฌ๊นŒ์ง€ ์ „๋‹ฌ๋˜๋Š” ์ •๋ณด์ด๋‹ค.
์•„๋ž˜ ์„ค๋ช…ํ•  ์„ธ ๊ฐ€์ง€์˜ gate๋ฅผ ํ†ตํ•ด cell state๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

Forget Gate ftf_t
๊ณผ๊ฑฐ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ์žŠ์„์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒŒ์ดํŠธ์ด๋‹ค.
์ฆ‰, ctโˆ’1c_{t-1}์„ ctc_t๋กœ ์ „๋‹ฌํ•  ๋•Œ ๊ณผ๊ฑฐ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋ฐ˜์˜ํ•  ๊ฒƒ์ธ์ง€๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค.
์ด ๊ฐ’์ด ํด์ˆ˜๋ก ๊ณผ๊ฑฐ ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•˜๋ฉฐ, ๋ฏธ๋ž˜๋กœ ์ „๋‹ฌ๋˜์–ด์•ผ ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

Input Gate iti_t
ํ˜„์žฌ ์ •๋ณด๋ฅผ cell state์— ์–ผ๋งˆ๋‚˜ ๋ฐ˜์˜ํ• ์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒŒ์ดํŠธ์ด๋‹ค.
์ด ๊ฐ’์ด ํด์ˆ˜๋ก ํ˜„์žฌ ์ •๋ณด๊ฐ€ ์žฅ๊ธฐ์ ์œผ๋กœ ๊ธฐ์–ต๋˜๊ณ , ์ „๋‹ฌ๋˜์–ด์•ผ ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

Output Gate oto_t
ํ˜„์žฌ ์ƒํƒœ์—์„œ์˜ ์ถœ๋ ฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒŒ์ดํŠธ์ด๋‹ค.
์ถœ๋ ฅ์— ๋Œ€ํ•ด ftf_t๋กœ ๊ณผ๊ฑฐ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋ฐ˜์˜ํ• ์ง€ ๊ฒฐ์ •ํ•˜๊ณ , oto_t๋กœ ํ˜„์žฌ์˜ input์ธ xtx_t๋ฅผ ์–ผ๋งˆ๋‚˜ ๋ฐ˜์˜ํ• ์ง€ ๊ฒฐ์ •ํ•œ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ RNN๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์žฅ๊ธฐ์ ์œผ๋กœ ๋ณด์กดํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. ์ด๋•Œ ft=1f_t = 1, it=0i_t = 0์ด๋ผ๋ฉด ๊ณผ๊ฑฐ ์ •๋ณด๋Š” ์ „๋ถ€ ๋ฏธ๋ž˜๋กœ ์ „๋‹ฌํ•˜๊ณ , ํ˜„์žฌ ์ •๋ณด๋Š” ์ „ํ˜€ ๋ฐ˜์˜ํ•˜์ง€ ์•Š์•„ cell state๊ฐ€ ๋ฌดํ•œํžˆ ๋ณด์žฅ๋œ๋‹ค. (๊ทธ๋Ÿผ it=1i_t = 1์ด๋ฉด LSTM์ด RNN์ฒ˜๋Ÿผ ๋™์ž‘ํ•˜๋‚˜?)
ํ•˜์ง€๋งŒ LSTM๋Š” cell state์™€ ์œ„ ์„ธ ๊ฐ€์ง€ gate๋ฅผ ํฌํ•จํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์•„์ ธ ๋น„ํšจ์œจ์ ์ด๋ผ๋Š” ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค. GRU๋Š” ์ด๋ฅผ ๊ฐœ์„ ํ•œ ๋ชจ๋ธ์ด๋‹ค!

GRU

์ถœ์ฒ˜: https://itrepo.tistory.com/40

๊ทธ๋ฆผ์˜ ์ƒ๋‹จ์— htโˆ’1h_{t-1}๋ถ€ํ„ฐ hth_t๊นŒ์ง€ ์ด์–ด์ง€๋Š” ๋ผ์ธ์ด LSTM์˜ cell state ์—ญํ• ์„ ํ•œ๋‹ค. ๋˜ํ•œ LSTM๊ณผ ๋‹ฌ๋ฆฌ GRU์—๋Š” ๋‘ ๊ฐ€์ง€ gate๊ฐ€ ์กด์žฌํ•œ๋‹ค.

Reset Gate rtr_t
LSTM์—์„œ์˜ Forget Gate์™€ ๊ฐ™์€ ์—ญํ• ์„ ํ•œ๋‹ค.

Update Gate ztz_t
ํ˜„์žฌ timestep tt์—์„œ์˜ input xtx_t๋ฅผ ์žฅ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ์— ์–ผ๋งˆ๋‚˜ ์ „๋‹ฌํ• ์ง€๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค. LSTM์—์„œ์˜ Input Gate์™€ ๊ฐ™์€ ๊ณผ์ •์ด๋‹ค.

์ด๋•Œ ๊ทธ๋ฆผ์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด htโˆ’1h_{t-1} ์ฆ‰, ๊ณผ๊ฑฐ ์ •๋ณด๊ฐ€ ๋‘ ๊ฐˆ๋ž˜๋กœ ๋‚˜๋ˆ„์–ด์ง€๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ํ•˜๋‚˜๋Š” Reset Gate๋ฅผ ํ†ต๊ณผํ•˜์—ฌ ๋ฏธ๋ž˜ ์ •๋ณด์— ๋ฐ˜์˜๋˜๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ํ˜„์žฌ timestep tt์—์„œ์˜ input xtx_t์™€ ๊ฒฐํ•ฉํ•˜์—ฌ output์„ ๊ฒฐ์ •ํ•œ๋‹ค.
๊ฒฐ๊ณผ์ ์œผ๋กœ GRU๋Š” hth_t๊ฐ€ LSTM์˜ cell state์™€ hidden state์˜ ์—ญํ• ์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜์—ฌ ๋ณด๋‹ค ๊ฐ€๋ณ๊ณ , ํšจ์œจ์ ์ธ ๋ชจ๋ธ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

Seq2Seq

์•ž์„œ ๊ณต๋ถ€ํ•œ RNN ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋“ค์€ one-to-many, many-to-one, many-to-many ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋”๋ผ๋„ ๊ฐ ์ž…๋ ฅ์— ์ถœ๋ ฅ์ด ์ˆœ์„œ๋Œ€๋กœ ๋Œ€์‘๋˜๊ฒŒ ๋œ๋‹ค. ์ด๋•Œ Sequentialํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ชจ๋ธ๋กœ ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๊ณ  ํ•ด๋ณด์ž.
I love you๋ผ๋Š” ๋ฌธ์žฅ์€ ๋‚˜๋Š” ๋„ˆ๋ฅผ ์‚ฌ๋ž‘ํ•ด๋ผ๋Š” ๋ฌธ์žฅ์œผ๋กœ ๋ฒˆ์—ญ๋  ๊ฒƒ์ด๋‹ค. ๋‹จ์–ด์˜ ์ˆœ์„œ์— ์˜ํ•˜๋ฉด I=๋‚˜๋Š”, love=๋„ˆ๋ฅผ, you=์‚ฌ๋ž‘ํ•ด๋ผ๋Š” ๊ตฌ์กฐ๊ฐ€ ๋œ๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ, ํ•œ๊ตญ์–ด์—๋Š” '์€(๋Š”)', '์„(๋ฅผ)'๊ณผ ๊ฐ™์€ ์กฐ์‚ฌ๊ฐ€ ์žˆ์„ ๋ฟ ์•„๋‹ˆ๋ผ ๋‹จ์–ด์˜ ๋ฐฐ์—ด ์ฆ‰, ๋ฌธ์žฅ์˜ ๊ตฌ์กฐ๊ฐ€ ์˜์–ด๋ž‘ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์–ด๋ผ๋ฆฌ ์ผ๋Œ€์ผ ๋Œ€์‘์„ ํ•˜๋Š” ๋ฒˆ์—ญ์€ ์˜ฌ๋ฐ”๋ฅด์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ํ•œ๊ตญ์–ด์™€ ์˜์–ด ์™ธ์—๋„ ๋‹ค์–‘ํ•œ ์–ธ์–ด์— ์ด์™€ ๊ฐ™์€ ๋ณ€์ˆ˜๊ฐ€ ์กด์žฌํ•  ๊ฒƒ์ด๋‹ค.
๋”ฐ๋ผ์„œ Seq2Seq ๋ชจ๋ธ์€ ์•ž์„œ ๊ณต๋ถ€ํ•œ RNN ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋“ค์„ ํŠน์ •ํ•œ ๋ฐฉ์‹์œผ๋กœ ์กฐํ•ฉํ•˜์—ฌ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

Encoder & Decoder

์ถœ์ฒ˜: https://wikidocs.net/24996 ๊ทธ๋ฆผ์—์„œ๋Š” LSTM์„ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ ์šฉ๋„์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ RNN๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

Seq2Seq ๋ชจ๋ธ์€ ์ž…๋ ฅ ๋‹จ์˜ ์…€๊ณผ ์ถœ๋ ฅ์ด ๋Œ€์‘๋œ๋‹ค๋Š” ํ‹€์—์„œ ๋ฒ—์–ด๋‚œ๋‹ค. ๋”ฐ๋ผ์„œ ํ•œ ๋‹จ์–ด, ํ˜น์€ ํ˜„์žฌ๋ถ€ํ„ฐ ์ด์ „๊นŒ์ง€์˜ ๋‹จ์–ด๋“ค๋งŒ ๋ณด๊ณ  ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ input sequence๋ฅผ ์ „๋ถ€ ํ™•์ธํ•œ ํ›„ ๋ฒˆ์—ญ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค. ์‚ฌ๋žŒ์ด ๋ฌธ์žฅ ์ „์ฒด๋ฅผ ๋ณด๊ณ  ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์ด๋‹ค. ์ด๋•Œ ๋ฌธ์žฅ ์ „์ฒด๋ฅผ ์ฝ๊ณ  ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์€ Encoder๊ฐ€ ์ˆ˜ํ–‰ํ•˜๊ณ , ํŒŒ์•…ํ•œ ์˜๋ฏธ๋ฅผ ๋ฒˆ์—ญ๋œ ์ƒํƒœ๋กœ ์ถœ๋ ฅํ•˜๋Š” ๊ฒƒ์€ Decoder๊ฐ€ ์ˆ˜ํ–‰ํ•œ๋‹ค.

Encoder
์œ„ ๊ทธ๋ฆผ์—์„œ ์™ผ์ชฝ์˜ Encoder ๋ถ€๋ถ„์„ ๋ณด๋ฉด ๊ฐ ์…€๋งˆ๋‹ค ์ถœ๋ ฅ์„ ๋‚ด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ hth_t ์ •๋ณด๊ฐ€ ๊ณ„์† ๋‹ค์Œ timestep์œผ๋กœ ์ „๋‹ฌ๋˜๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ฒซ ๋ฒˆ์งธ timestep=11์ด๋ผ๊ณ  ํ•œ๋‹ค๋ฉด, h1h_1์€ I๋ผ๋Š” ๋‹จ์–ด๋ฅผ ์ธ์ฝ”๋”ฉํ•œ ์ •๋ณด๋ฅผ ๊ฐ–๊ณ , h2h_2๋Š” I am์„ ์ธ์ฝ”๋”ฉํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๋ฉฐ, ์ตœ์ข…์ ์œผ๋กœ h4h_4๋Š” ์ „์ฒด ๋ฌธ์žฅ์˜ ์ธ์ฝ”๋”ฉ๋œ ์ •๋ณด๋ฅผ ๊ฐ–๋Š”๋‹ค. ์ด๋•Œ ์ธ์ฝ”๋”ฉํ•œ ์ •๋ณด๋ฅผ input์— ๋Œ€ํ•œ embedding vector๋ผ๊ณ  ํ•˜๋ฉฐ, ์ตœ์ข… ์ถœ๋ ฅ๋œ ๋ฒกํ„ฐ๋Š” ์ „์ฒด ๋ฌธ์žฅ์— ๋Œ€ํ•œ embedding vector๊ฐ€ ๋œ๋‹ค.

Decoder
Decoder์—์„œ๋Š” Encoder์—์„œ ์ถœ๋ ฅํ•œ embedding vector๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค. ์ด๋•Œ Auto-Regressive Generation์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. Auto-Regressive Generation์ด๋ž€ timestep=tt์— ์ถœ๋ ฅ yty_t๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์ด์ „ ๋‹จ๊ณ„์˜ htโˆ’1h_{t-1}๊ณผ ํ•จ๊ป˜ ytโˆ’1y_{t-1}์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๊ณผ์ •์„ ํ†ตํ•ด ๋ชจ๋ธ์ด ์ž์‹ ์ด ์ถœ๋ ฅํ–ˆ๋˜ ๋‹จ์–ด๋ฅผ ๊ธฐ์–ตํ•˜๋ฉฐ ๋‹ค์Œ์— ์ถœ๋ ฅํ•  ์ ์ ˆํ•œ ๋‹จ์–ด๋ฅผ ์„ ํƒํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ „์ฒด ๋ฌธ์žฅ์— ๋Œ€ํ•œ inference ์„ฑ๋Šฅ์ด ์ข‹์•„์ง„๋‹ค.

RNN์˜ recurrentํ•œ ๊ตฌ์กฐ
timestep=tt์— ์ด์ „ ๋‹จ๊ณ„์—์„œ ๊ณ„์‚ฐํ•œ htโˆ’1h_{t-1}๊ณผ ํ˜„์žฌ์˜ input์ธ xtx_t๋ฅผ ์‚ฌ์šฉ

Auto-Regressive Generation
htโˆ’1h_{t-1}๊ณผ ์ด์ „ ๋‹จ๊ณ„์˜ ์ถœ๋ ฅ์ธ ytโˆ’1y_{t-1}์„ ์‚ฌ์šฉ

์˜๋ฏธ์  ์ฐจ์ด
RNN์˜ recurrentํ•œ ๊ตฌ์กฐ๋Š” ์ „์ฒด sequence๋ฅผ ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์—ฐ์†์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ตฌ์กฐ์ธ ๊ฒƒ์ด๊ณ , Auto-Regressive Generation์€ ๋‹ค์Œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ ๋‹จ๊ณ„์ ์ธ ๊ณผ์ •์ธ ๊ฒƒ์ด๋‹ค.

์ฐธ๊ณ ๋กœ ํ•™์Šตํ•  ๋•Œ๋Š” ์ด์ „ ๋‹จ๊ณ„์˜ ์ถœ๋ ฅ ๋Œ€์‹  ์ด์ „ ๋‹จ๊ณ„์˜ ์ •๋‹ต์ด ์‚ฌ์šฉ๋˜๋ฉฐ, ์ด๋ฅผ Teacher Forcing์ด๋ผ๊ณ  ํ•œ๋‹ค.




์ด์ œ Transformer์— ๊ฑฐ์˜ ๋‹ค ์™”๋‹ค!

Attention Mechanism

์•ž์„œ RNN, LSTM, GRU๋Š” ๊ฐ๊ฐ ๊ธฐ์กด์˜ ๋ชจ๋ธ์˜ ๋‹จ์ ์„ ๊ฐœ์„ ํ•œ ๋ชจ๋ธ์ด๊ณ , ์ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํšจ๊ณผ์ ์œผ๋กœ Sequentialํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด Seq2Seq ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค๊ณ  ํ–ˆ๋‹ค.
ํ•˜์ง€๋งŒ Seq2Seq ๋˜ํ•œ
1. RNN์˜ ๊ธฐ๋ณธ์ ์ธ ๋ฌธ์ œ์ธ Vanishing Gradient ๋ฌธ์ œ
2. ํ•˜๋‚˜์˜ embedding vector์— ๋ชจ๋“  ์ •๋ณด๋ฅผ ์ €์žฅํ•˜๋ ค๊ณ  ํ•  ๋•Œ ๊ธด sequence์˜ ๊ฒฝ์šฐ ์ •๋ณด์˜ ์†์‹ค ๋ฐœ์ƒ ๋ฌธ์ œ
๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด โญAttention ๊ฐœ๋…โญ์ด ๋„์ž…๋˜์—ˆ๋‹ค.

Attention์ด๋ž€?

Seq2Seq์˜ Decoder์—์„œ ์ถœ๋ ฅ์„ ๋‚ผ ๋•Œ input์˜ ๋ชจ๋“  ์ •๋ณด๋ฅผ ์••์ถ•ํ•˜์—ฌ ํฌํ•จํ•˜๋Š” embedding vector๋ฟ ์•„๋‹ˆ๋ผ Encoder์˜ ๋ชจ๋“  input์˜ ๊ฐ ๋‹จ๊ณ„์—์„œ ๊ณ„์‚ฐ๋œ hidden state๋ฅผ ์‚ฌ์šฉํ•˜๋„๋ก ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋•Œ ๋ชจ๋“  step์ด ๋ฌธ์žฅ ๋‚ด์—์„œ ๊ฐ™์€ ์ •๋„์˜ ์ค‘์š”๋„๋ฅผ ์ฐจ์ง€ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— Attention score๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•œ mechanism์ด ํ•„์š”ํ•˜๋‹ค.

Attention ํ•จ์ˆ˜

Attention์€ Query, Key, Value๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.
Decoder์˜ ํ˜„์žฌ ์ƒํƒœ์˜ hidden state๊ฐ€ ๋น„๊ต์ž๊ฐ€ ๋˜์–ด Encoder์˜ ๋ชจ๋“  hidden state๋“ค์„ ๋น„๊ต ๋Œ€์ƒ์œผ๋กœ ํ•˜์—ฌ ์–ผ๋งŒํผ ์œ ์‚ฌํ•œ์ง€๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ๋”ฐ๋ผ์„œ Query, Key, Value๋Š” ๊ฐ™์€ shape์„ ๊ฐ€์ ธ์•ผ ํ•œ๋‹ค.
Attention์„ ํ•จ์ˆ˜๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Attention(Q,K,V)=Attentionย Value\text{Attention}(Q, K, V) = \text{Attention Value}

Query t ์‹œ์ ์˜ Decoder ์…€์—์„œ์˜ hidden state
Key ๋ชจ๋“  ์‹œ์ ์˜ Encoder ์…€์˜ hidden states
Value ๋ชจ๋“  ์‹œ์ ์˜ Encoder ์…€์˜ hidden states

Dot Product Attention

Seq2Seq ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋˜๋Š” Dot Product Attention์„ ๋จผ์ € ์‚ดํŽด๋ณด์ž.

์ถœ์ฒ˜: https://wikidocs.net/22893

๊ทธ๋ฆผ์—์„œ Decoder์˜ ์ƒํƒœ๋ฅผ ๋ณด๋ฉด ์ด๋ฏธ Attention Mechanism์„ ํ†ตํ•ด ์„ธ ๊ฐœ์˜ token๋“ค์„ ์ถ”์ถœํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 'suis'๋ฅผ ์ถœ๋ ฅํ•œ ์„ธ ๋ฒˆ์งธ LSTM ์…€์€ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์ถœ๋ ฅํ•˜๊ธฐ ์œ„ํ•ด Encoder์˜ ๋ชจ๋“  hidden state๋“ค์„ ์‚ดํŽด๋ณผ ๊ฒƒ์ด๋‹ค.

1. Attention Score ๊ณ„์‚ฐ
Encoder์˜ ๋ชจ๋“  hidden state: h1,...,hTh_1, ..., h_T (๊ทธ๋ฆผ์„ ๊ธฐ์ค€์œผ๋กœ T=4)
Decoder์˜ hidden state: sts_t (๊ทธ๋ฆผ์„ ๊ธฐ์ค€์œผ๋กœ t=3)
์ด๋•Œ hth_t์™€ sts_t์˜ shape์€ ๋™์ผํ•˜๋‹ค.

Attention Score: et=[stTh1,...,stThT]e_t = [s_t^\text{T}h_1, ..., s_t^\text{T}h_T]
๋‘ ๋ฒกํ„ฐ๋ฅผ ๋‚ด์ ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ์ฆ‰, ํ˜„์žฌ Decoder์˜ LSTM์˜ hidden state์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ Encoder์˜ hidden state๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋ผ๊ณ  ๋ด๋„ ๋ฌด๋ฐฉํ•˜๋‹ค.

2. Softmax ํ•จ์ˆ˜๋กœ Attention Distribution ๊ตฌํ•˜๊ธฐ
Softmax ํ•จ์ˆ˜๋Š” Multi-class Classification์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜์ด๋ฉฐ, ๋ชจ๋“  ๊ฐ’์„ 0๊ณผ 1์‚ฌ์ด์— ๋ถ„ํฌํ•˜๊ฒŒ ํ•˜๋ฉฐ, ๊ทธ ๊ฐ’๋“ค์˜ ์ดํ•ฉ์ด 1์ด ๋˜๋Š” ํŠน์„ฑ์ด ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฅผ ete_t์— ์ ์šฉํ•˜๋ฉด ๊ฐ ๊ฐ’๋“ค์ด ์–ด๋–ค ๋ถ„ํฌ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š”์ง€๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด ๊ฐ’์„ Attention Distribution ๋˜๋Š” Attention Coefficient๋ผ๊ณ  ํ•˜๋ฉฐ, ๊ฐ๊ฐ์˜ ๊ฐ’์„ Attention Weight๋ผ๊ณ  ํ•œ๋‹ค.

ฮฑt=softmax(et)\alpha_t = softmax(e_t)

3. Attention Value ๊ณ„์‚ฐ
๊ฐ Encoder์˜ Attention Weight ฮฑt\alpha_t๊ณผ hidden state hth_t๋ฅผ ๊ณฑํ•œ๋‹ค. ์•ž์„œ ๊ณ„์‚ฐํ•œ Attention Score๊ฐ€ Decoder์˜ ํ˜„์žฌ ์ƒํƒœ(t)์˜ hidden state์™€์˜ ์œ ์‚ฌ๋„๋ผ๊ณ  ํ–ˆ์œผ๋ฏ€๋กœ, ์ด ๊ฐ’์„ Encoder์˜ hidden state์™€ ๊ณฑํ•จ์œผ๋กœ์จ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ token์˜ ์˜ํ–ฅ๋ ฅ์„ ๊ฐ€์žฅ ํฌ๊ฒŒ ๋งŒ๋“ค๊ฒ ๋‹ค๋Š” ์˜๋ฏธ๊ฐ€ ๋œ๋‹ค. ์ด์ œ ๊ณ„์‚ฐํ•œ ๊ฐ’๋“ค์„ ๋ชจ๋‘ ๋”ํ•œ๋‹ค.

at=โˆ‘i=1T[ฮฑt]ihia_t = \sum_{i=1}^T[\alpha_t]_ih_i

์ด ๊ฐ’์ด Attention Value์ด๋‹ค!!

4. Concatenate & sts_t
์ด๋ ‡๊ฒŒ ๊ณ„์‚ฐํ•œ Attention Value๋ฅผ ๋‹ค์‹œ Decoder์˜ hidden state sts_t์™€ concatenateํ•˜์—ฌ vtv_t๋ฅผ ๋งŒ๋“ ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์ถœ๋ ฅ์ธต์˜ ์ž…๋ ฅ์ด ๋˜๋Š” sts_t๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด Linear ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  tanh\text{tanh}๋ฅผ ํ†ต๊ณผ์‹œํ‚จ๋‹ค.

5. Prediction
sts_t๋ฅผ ์ถœ๋ ฅ์ธต์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ข… ์ถœ๋ ฅ์„ ๊ณ„์‚ฐํ•œ๋‹ค.

์•„์ด๊ณ  ๋ณต์žกํ•˜๋‹ค.. ์ •๋ฆฌํ•˜์ž๋ฉด!
1. Decoder์˜ ํ˜„์žฌ ๋‹จ๊ณ„์—์„œ ์ถœ๋ ฅ์„ ๋‚ด๊ธฐ ์œ„ํ•ด Encoder์˜ ๋ชจ๋“  hidden state๋“ค์„ ํŒŒ์•…ํ•œ๋‹ค.
2. Encoder์˜ hidden state๋“ค ๊ฐ๊ฐ์„ ์ถœ๋ ฅ์— ์–ผ๋งˆ๋‚˜ ๋ฐ˜์˜ํ• ์ง€ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด Attention Value๋ฅผ ์ถ”์ถœํ•œ๋‹ค.
3. Attention Value๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Decoder์—์„œ ์ถœ๋ ฅ์„ ๊ณ„์‚ฐํ•œ๋‹ค.

Attention์„ ๊ณ„์‚ฐํ•˜๋Š” ์ข…๋ฅ˜๋Š” ์ด๊ฒƒ ๋ง๊ณ ๋„ ๋‹ค์–‘ํ•˜๋‹ค๊ณ  ํ•˜์ง€๋งŒ, Transformer๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ณธ์ ์ธ, Seq2Seq์—์„œ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” mechanism์„ ์‚ดํŽด๋ณธ ๊ฒƒ์ด๋‹ค.

์ด์ œ Transformer๋ฅผ ๋งž์ดํ•  ์ค€๋น„๊ฐ€ ๋‹ค ๋๋‚ฌ๋‹ค... ๋“œ๋””์–ด...
Keyword๋งŒ์œผ๋กœ ๋ณต์Šตํ•ด๋ณด์ž!
RNN > LSTM, GRU > Seq2Seq > Dot-Product Attention
๋‹ค ๊ธฐ์–ต๋‚˜์ง€์š”?




Tansformer

Seq2Seq์˜ ํ•œ๊ณ„

๊ธฐ์กด์˜ RNN ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋“ค์€ Vanishing Gradient Problem์ด ์žˆ์—ˆ๋‹ค. ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด Attention mechanism์ด ์‚ฌ์šฉ๋˜์—ˆ๋Š”๋ฐ, RNN์„ ๋ณด์ •ํ•˜๋Š” ๊ธฐ๋Šฅ์ด ์•„๋‹Œ, Attention ์ž์ฒด๋งŒ์œผ๋กœ sequentialํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ชจ๋ธ์ด ๋ฐ”๋กœ Transformer์ด๋‹ค.
Transformer๋Š” RNN ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋“ค์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์ง€๋งŒ, Seq2Seq์˜ Encoder Decoder ๊ตฌ์กฐ๋Š” ์œ ์ง€ํ•œ๋‹ค. Attention mechanism ๋˜ํ•œ Seq2Seq์™€ ์‚ด์ง ๋‹ค๋ฅธ๋ฐ, ์—ฌ๊ธฐ์„œ๋Š” Transformer๋Š” Self-Attention์ด๋ผ๋Š” mechanism์„ ์‚ฌ์šฉํ•œ๋‹ค.

Intro

์ถœ์ฒ˜: https://wikidocs.net/31379

๊ธฐ์กด์˜ Seq2Seq์—์„œ๋Š” Decoder์˜ hidden state์—์„œ Encoder์˜ ๋ชจ๋“  hidden state๋“ค๊ณผ ๊ฒฐํ•ฉ๋˜์—ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ˜„์žฌ ๋‹จ๊ณ„์—์„œ ์ถœ๋ ฅ์„ ๋‚ด๊ธฐ ์œ„ํ•ด ์ž…๋ ฅ sequence์˜ ๋ชจ๋“  ๊ด€๊ณ„๋ฅผ ์‚ดํŽด๋ณธ ๊ฒƒ์ด๋‹ค. ๊ตฌ์กฐ ๋˜ํ•œ ์ž…๋ ฅ sequence์— ๋งž์ถฐ token ๋‹น ํ•˜๋‚˜์˜ RNN ๊ธฐ๋ฐ˜์˜ cell์ด ์‚ฌ์šฉ๋˜์—ˆ๋Š”๋ฐ, Transformer์—์„œ๋Š” sequence data๊ฐ€ ํ•˜๋‚˜์˜ Encoder์— ํ†ต์งธ๋กœ ์ž…๋ ฅ๋œ๋‹ค.

Positional Encoding

Transformer๋Š” ํ˜„์žฌ์˜ ํƒ€๊ฒŸ token๊ณผ ์ž์‹ ์„ ํฌํ•จํ•œ ์ž…๋ ฅ์˜ ๋ชจ๋“  token์„ ๋น„๊ตํ•œ ๊ฐ’์„ Attention์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ sequentialํ•จ์—๋„ ์ˆœ์„œ ์ •๋ณด๊ฐ€ ๋ฌด์‹œ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ์ˆœ์„œ ์ •๋ณด๋ฅผ ์ธ์œ„์ ์œผ๋กœ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•ด Positional Encoding์ด ์ ์šฉ๋œ๋‹ค. ์ด ๋ฐฉ์‹์€ Encoder์™€ Decoder์— ๋™์ผํ•˜๊ฒŒ ์ ์šฉ๋œ๋‹ค.

Sinusoidal Encoding
sin\text{sin}๊ณผ cos\text{cos} ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ Positional Encoding ๋ฐฉ์‹์ด๋‹ค.

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = sin(\frac{pos}{10000^{2i/d_{model}}})
PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = cos(\frac{pos}{10000^{2i/d_{model}}})

์ˆ˜์‹์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด sin, cos ํ•จ์ˆ˜์— ์œ„์น˜๊ฐ’์ด ์ž…๋ ฅ๋˜์—ˆ์„ ๋•Œ ๊ทธ ์ฃผ๊ธฐ๋ฅผ ๋Š˜๋ ค sequence ์ •๋ณด์— ์ฃผ๊ธฐ์„ฑ์ด ์—†๊ฒŒ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๊ธด sequence ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์œ„์น˜์˜ token์— ๋™์ผํ•œ ๊ฐ’์ด ํ• ๋‹น๋˜์ง€ ์•Š๋Š”๋‹ค.
๋˜ํ•œ ์œ„์น˜๊ฐ€ ์ง์ˆ˜(2i2i)์ธ์ง€ ํ™€์ˆ˜(2i+12i+1)์ธ์ง€์— ๋”ฐ๋ผ sin\text{sin}๊ณผ cos\text{cos} ํ•จ์ˆ˜๋ฅผ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉํ•˜๋Š”๋ฐ, ๊ฐ ํ•จ์ˆ˜์— ๋™์ผํ•˜๊ฒŒ 2i/d_model2i/d\_\text{model}์ด ์ ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์œ ์‚ฌํ•˜์ง€๋งŒ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ’์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Self Attention - Encoder

Transformer์˜ self-attention์€ ํ˜„์žฌ ๊ธฐ์ค€์ด ๋˜๋Š” ๋‹จ์–ด์ธ Query๋ฅผ ์ž๊ธฐ ์ž์‹ ์„ ํฌํ•จํ•œ context์ธ Key๋“ค๊ณผ ๋น„๊ตํ•œ๋‹ค. ์ด ๊ณผ์ •์„ ํ†ตํ•ด ์ž๊ธฐ ์ž์‹ ์ด context ๋‚ด์—์„œ ์–ด๋–ค ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— self-attention์ด๋ผ๋Š” ์ด๋ฆ„์ด ๋ถ™์—ˆ๋‹ค. ์ด๋•Œ Value๋Š” context์˜ ๋‹จ์–ด๋“ค์ด๋‹ค.

์ถœ์ฒ˜: ๋‚˜! me!

1. Query, Key, Value ๋งŒ๋“ค๊ธฐ
๊ฐ input์˜ ํ† ํฐ๋“ค์„ WQW_Q, WKW_K, WVW_V์™€ ๋‚ด์ ํ•˜์—ฌ Query, Key, Value๋ฅผ ๋งŒ๋“ค๊ณ , ํ˜„์žฌ์˜ timestep tt์˜ ํƒ€๊ฒŸ์— ํ•ด๋‹นํ•˜๋Š” xtx_t์˜ Query์™€ ๊ณฑํ•˜๋ฉด ์ด์ „์˜ Dot-Product Attention๊ณผ ๋™์ผํ•˜๊ฒŒ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋•Œ!!!
๊ฐ ๊ฐ€์ค‘์น˜๋“ค์€ ๊ฐ™์€ ๊ณ„์ธต(Q, K, V)์— ๋Œ€ํ•ด์„œ๋Š” ๊ทธ ๊ฐ’์„ ๊ณต์œ ํ•˜์ง€๋งŒ ์„œ๋กœ ๋‹ค๋ฅธ ๊ณ„์ธต๊ณผ๋Š” ๊ณต์œ ๋˜์ง€ ์•Š๋Š”๋‹ค. ๋”ฐ๋ผ์„œ Query๋Š” Query๋ผ๋ฆฌ, Key๋Š” Key๋ผ๋ฆฌ, Value๋Š” Value๋ผ๋ฆฌ๋งŒ ํ•™์Šต๋œ๋‹ค.

2. Scaled-Dot Product Attention
์•ž ๋‹จ๊ณ„์—์„œ ๊ณ„์‚ฐํ•œ ์œ ์‚ฌ๋„๋ฅผ Query์™€ Key์˜ ๊ธธ์ด์ธ dk\sqrt{d_k}๋กœ ๋‚˜๋ˆ„์–ด scaling์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. Transformer๋Š” ํ•˜๋‚˜์˜ encoder์—์„œ ์—ฌ๋Ÿฌ ๋ฒˆ์˜ Attention์„ ์ˆ˜ํ–‰ํ•˜๋Š” multi-head attention์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ๊ฐ์˜ head์—์„œ ๊ฐ™์€ ์ˆ˜์˜ token์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ ๊ฐ’์„ scaling์— ์‚ฌ์šฉํ•œ๋‹ค.

dk=dmodel/num_heads\sqrt{d_k} = d_{model} / \text{num\_heads}

scaling์„ ํ•˜๋Š” ์ด์œ 
Query์™€ Key์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ ์ˆ˜๋ก ํ•˜๋‚˜์˜ token์— ๋Œ€ํ•œ ๋‚ด์ ๊ฐ’์ด ์ปค์ง€๋ฉฐ, ๋ชจ๋“  token๋“ค์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋„“์€ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋‘ Softmax๋ฅผ ํ†ต๊ณผํ•˜๋ฉด ๊ฐ ๊ฐ’๋“ค์ด 0์— ๊ฐ€๊นŒ์šด ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ณ ์ž scaling์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

3. Attention Value ๊ณ„์‚ฐ
scaling๊นŒ์ง€ ์ˆ˜ํ–‰ํ•œ ํ›„ ๊ฐ token๋“ค์— ๋Œ€ํ•œ ๊ฐ’์„ ๊ตฌํ–ˆ๋‹ค๋ฉด, ์ „์ฒด์˜ ๋ถ„ํฌ๋ฅผ 0๊ณผ 1์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ์ ์ ˆํžˆ ๊ตฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด Softmax\text{Softmax} ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์นœ๋‹ค. ๊ทธ ๋‹ค์Œ ํƒ€๊ฒŸ token์ธ xtx_t์˜ Value์™€ ๊ณฑํ•œ ๋’ค concatenationํ•œ๋‹ค.
์ถœ๋ ฅ๋œ ๊ฒฐ๊ณผ๊ฐ€ ๋ฐ”๋กœ Attention Score์ด๋‹ค.

4. FFNN(Feed Forward Neural Network)
3๋ฒˆ์—์„œ ๊ตฌํ•œ Attention Score๋Š” ๋ชจ๋“  multi-head์˜ token๋“ค์„ concatํ•œ ๊ฐ’์ด๊ธฐ ๋•Œ๋ฌธ์— (seq_len,dmodel\text{seq\_len}, d_{model})์˜ shape์„ ๊ฐ€์ง„๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฅผ ์›๋ž˜์˜ input ๋ชจ์–‘์œผ๋กœ ๋˜๋Œ๋ฆฌ๊ธฐ ์œ„ํ•ด ํ•œ ๋ฒˆ ๋” ์„ ํ˜• ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด๋•Œ ์ „์ฒด ์ •๋ณด๋ฅผ ์••์ถ•ํ•˜๋ฉฐ ํŠน์ง• ์ถ”์ถœ์„ ํ•œ ๋ฒˆ ๋” ํ•˜๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค.

5. Multi-head Self Attention
์•ž์„œ ์ž ๊น ์–ธ๊ธ‰ํ–ˆ์ง€๋งŒ Transformer๋Š” multi-head๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ํ•˜๋‚˜์˜ Encoder์— ์ž…๋ ฅ๋˜๋Š” token๋“ค์— ๋Œ€ํ•˜์—ฌ, ๋ชจ๋ธ์„ ๋นŒ๋“œํ•  ๋•Œ ๋ฏธ๋ฆฌ ์ •์˜๋œ ๊ฐœ์ˆ˜๋งŒํผ ๊ตฌ๋ถ„ํ•˜์—ฌ ๋ณ‘๋ ฌ๋กœ Attention Score ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์ตœ์ข…์ ์œผ๋กœ concatํ•œ๋‹ค. ์ฆ‰, 2, 3๋ฒˆ์˜ ๊ณผ์ •์„ Encoder ๋‚ด์—์„œ ๋ณ‘๋ ฌ๋กœ num_heads\text{num\_heads}๋งŒํผ ์ˆ˜ํ–‰ํ•˜๊ณ  4๋ฒˆ์˜ FFNN์„ ๊ฑฐ์นœ๋‹ค.
๋˜ํ•œ Attention Is All You Need์—์„œ๋Š” Encoder์™€ Decoder๋ฅผ ๊ฐ๊ฐ 6๊ฐœ์”ฉ ์Œ“์•˜๊ธฐ ๋•Œ๋ฌธ์— 2-4๋ฒˆ ๊ณผ์ •์„ 6๋ฒˆ ์ˆ˜ํ–‰ ํ›„ Decoder๋กœ ์ „๋‹ฌ๋œ๋‹ค!

Self Attention - Decoder

Decoder๋Š” ๋‘ ๊ฐœ์˜ Sub-layer๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ํ•˜๋‚˜๋Š” Masked Multi-head Self Attention์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ๋‘ ๋ฒˆ์งธ๋Š” Multi-head Self Attention์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

1. Masked Multi-head Self Attention
Decoder์˜ ์ž…๋ ฅ์€ Encoder ๋˜๋Š” ์ด์ „ ๋‹จ๊ณ„์˜ Decoder์—์„œ ์ถœ๋ ฅ๋œ sequence ๋ฐ์ดํ„ฐ๋ฅผ Positional Encodingํ•œ ๊ฐ’์ด๋‹ค. ์ฆ‰, ๋ฐ”๋กœ ์ „ ๋‹จ๊ณ„์˜ block์—์„œ ์ถœ๋ ฅ๋œ ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค์ œ ์ถ”๋ก  ์‹œ Decoder๊ฐ€ ์ถœ๋ ฅ์„ ๋‚ผ ๋•Œ ํ˜„์žฌ ๋‹จ๊ณ„ ์ดํ›„์˜ ์ •๋ณด๋Š” ์•Œ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— Input๋“ค ์ค‘ ํ˜„์žฌ์˜ timestep tt ์ดํ›„์˜ ์ •๋ณด๋“ค์€ maskingํ•œ๋‹ค. ์ดํ›„ Decoder ๋‚ด์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” self-attention ์—ฐ์‚ฐ์€ Encoder์—์„œ ์‚ฌ์šฉํ•œ ๋ฐฉ์‹๊ณผ ๋™์ผํ•˜๋‹ค.

2. Encoder-Decoder Multi-head Self Attention
Decoder์˜ ์ฒซ ๋ฒˆ์งธ Sub-layer์˜ ์ž…๋ ฅ์€ ๋ฐ”๋กœ ์ „ ๋‹จ๊ณ„์˜ ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ๊ฐ์˜ Query, Key, Value์˜ ์ถœ์ฒ˜๊ฐ€ ๋™์ผํ•˜์ง€๋งŒ, ๋‘ ๋ฒˆ์งธ Sub-layer๋ถ€ํ„ฐ๋Š” Query๋งŒ Decoder์—์„œ ์ถœ๋ ฅํ•œ ๊ฐ’์„ ์‚ฌ์šฉํ•˜๊ณ , Key, Value๋Š” ๋งˆ์ง€๋ง‰ Encoder์˜ ๊ฒƒ์„ ์‚ฌ์šฉํ•œ๋‹ค.

Encoder์˜ ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ 
Decoder์˜ ์ฒซ ๋ฒˆ์งธ Sub-layer์—์„œ ์ถœ๋ ฅํ•œ Query๋ฅผ Encoder์˜ Key, Value์™€ ๋น„๊ตํ•จ์œผ๋กœ์จ Decoder๊ฐ€ ํ˜„์žฌ ์ถœ๋ ฅํ•  token์ด ์ตœ์ดˆ ์ž…๋ ฅ๊ณผ ์–ด๋–ค ๊ด€๊ณ„๋ฅผ ๊ฐ–๋Š”์ง€๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค.
์ฆ‰, Encoder-Decoder ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ์ž…๋ ฅ ๋ฌธ์žฅ๊ณผ ์ถœ๋ ฅ ๋ฌธ์žฅ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•œ๋‹ค.

์ด์ œ Attention Is All You Need๋ฅผ ์ฝ์œผ๋ฉฐ ๋†“์นœ ๊ฐœ๋…, ์ž˜ ์—ฐ๊ฒฐ์ด ๋˜์ง€ ์•Š์€ ๊ฐœ๋…๋“ค์„ ๋” ์ฑ™๊ธฐ๋„๋ก ํ•˜๊ฒ ๋‹ค!

profile
๐ŸŒˆ๐ŸŒผ๐ŸŒธโ˜€๏ธ

0๊ฐœ์˜ ๋Œ“๊ธ€