[DL] Attention - Seq2Seq Models

cha-suyeonยท2022๋…„ 3์›” 11์ผ
0

NLP

๋ชฉ๋ก ๋ณด๊ธฐ
6/6

Reference

๐Ÿ“„ Attention - Seq2Seq Models

seq2seq with Attention์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ํฌ์ŠคํŒ…ํ•ฉ๋‹ˆ๋‹ค. ์œ„ reference๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‚ด์šฉ์„ ์ดํ•ดํ•˜๊ณ  ์ •๋ฆฌํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์žฌ์ƒ์‚ฐ๋˜๋Š” ์ œ ๋‚ด์šฉ ์ค‘ ์˜ค๋ฅ˜๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿด ๊ฒฝ์šฐ ๋Œ“๊ธ€๋กœ ์•Œ๋ ค์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ, LSTM์„ ์‚ฌ์šฉํ•œ seq2seq์— ๋Œ€ํ•ด ๋จผ์ € ์ดํ•ด๋ฅผ ํ•˜๊ณ  ์‹ถ์œผ์‹œ๋‹ค๋ฉด Sequence to Sequence model ํ•ด๋‹น ํฌ์ŠคํŒ…์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”. ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.


Seq2Seq Model

Seq2Seq Model์ด ๋ฌด์—‡์ธ์ง€ ๋จผ์ € ์ดํ•ดํ•˜๊ณ  ๊ฐ€๋ฉด ์ข‹์„ ๋“ฏ ํ•ฉ๋‹ˆ๋‹ค.

๋‹จ์–ด๋‚˜ ๋ฌธ์ž, ์‹œ๊ณ„์—ด ๋“ฑ ์ผ๋ จ์˜ ํ•ญ๋ชฉ์„ ๊ฐ€์ ธ์™€์„œ ๋‹ค๋ฅธ sequence๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๋ชจ๋ธ์€ encoder์™€ decoder๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

encoder์€ hidden state vector ํ˜•ํƒœ๋กœ input sequence์˜ context๋ฅผ ํฌ์ฐฉํ•˜๊ณ , decoder๋กœ ํ˜๋ ค ๋ณด๋‚ด output sequence๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ชจ๋“  ๊ณผ์ •์€ sequence๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ RNN, LSTM, GRU ๋“ฑ์˜ ํ˜•์‹์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

hidden state vector๋Š” size์— ์ƒ๊ด€์—†์ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ 2์˜ ๊ฑฐ๋“ญ์ œ๊ณฑ์œผ๋กœ ๊ฐ„์ฃผ๋ฉ๋‹ˆ๋‹ค. (256, 512, 1024, ...)


RNN

๊ทธ๋ฆผ์œผ๋กœ๋งŒ ๋ณด๋ฉด RNN์€ 2๊ฐœ์˜ input์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

ํ˜„์žฌ ๋ฐ›๋Š” ๊ฐ’๊ณผ ์ด์ „ input์˜ representation์„ ๋ฐ›๋Š” ๊ฒƒ์ด์ฃ .

๊ทธ๋Ÿผ time step t์—์„œ์˜ output์€ ํ˜„์žฌ input๊ณผ tโˆ’1t-1์˜ ์ž…๋ ฅ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.

์ด๋Ÿฐ sequentialํ•œ ์ •๋ณด๋Š” network์˜ hidden state๋กœ ๋ณด์กด๋˜์–ด์„œ ๋‹ค์Œ instance์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

RNN์œผ๋กœ ๊ตฌ์„ฑ๋œ encoder๋Š” sequence๋ฅผ input์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ , sequence ๋์— ์ตœ์ข… imbedding์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ decoder๋Š” ์ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ sequence๋ฅผ ์˜ˆ์ธกํ•˜๊ณ  ๋ชจ๋“  ์—ฐ์†์„ ์˜ˆ์ธกํ•œ ๋‹ค์Œ์— ์ด์ „ hidden state๋ฅผ ์‚ฌ์šฉํ•˜ sequence์˜ ๋‹ค์Œ instance๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

output sequence๋Š” encoder์˜ ์ตœ์ข… ์ถœ๋ ฅ์—์„œ hidden state์— ์˜ํ•ด ์ •์˜๋œ context์— ํฌ๊ฒŒ ์˜์กดํ•˜๊ธฐ ๋•Œ๋ฌธ์— model์ด ๊ธด ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.

sequence๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก ์ดˆ๊ธฐ์˜ context๊ฐ€ ์†์‹ค๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

์ด๋Ÿฐ ๋ฐฉ๋ฒ•์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Attention์ด๋ผ๋Š” ๊ธฐ์ˆ ์ด ๋„์ž…๋˜์–ด ๋ชจ๋ธ์ด output sequence์˜ ๋ชจ๋“  step์—์„œ input sequence์˜ ๋‹ค๋ฅธ ๋ถ€๋ถ„์— ์ดˆ์ ์„ ๋งž์ถฐ context๋ฅผ ๋ณด์กดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.


Attention

Now Iโ€™m getting your ATTENTION! ์ด๋ผ๋„ค์š”. ใ…Žใ…Ž ์ €๋„ ๊ด€์‹ฌ ์ข€ ์ฃผ์„ธ์š”...

์•ž์˜ ๋‚ด์šฉ์„ ๊ฐ„๋‹จํžˆ ์ •๋ฆฌํ•˜์ž๋ฉด, ๊ฒฐ๋ก ์ ์œผ๋กœ encoder์˜ ๋์— ์žˆ๋Š” single hidden state vector๊ฐ€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์ด ๋ฌธ์ œ์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์€ input sequence์˜ ์ˆ˜๋งŒํผ ๋งŽ์€ hidden state vector๋ฅผ ๋ณด๋‚ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค!

decoder๋Š” hidden state vector๋ฅผ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ• ๊นŒ์š”?

์ง€๊ธˆ๊นŒ์ง€ ๋‘ model์˜ ์œ ์ผํ•œ ์ฐจ์ด์ ์€ decoding ๋‹จ๊ณ„์—์„œ ๋ชจ๋“  input instance์˜ hidden state๋ฅผ ๋„์ž…ํ–ˆ๋‹ค๋Š” ์  ๋ฟ์ธ๋ฐ์š”.

Attention ๊ธฐ๋ฐ˜ model์„ ๋งŒ๋“œ๋Š”๋ฐ ์ถ”๊ฐ€๋œ ๋˜ ๋‹ค๋ฅธ ๊ธฐ๋Šฅ์€ context vector์ž…๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ output sequence์˜ ๋ชจ๋“  time instance์— ์˜ํ•ด ์ƒ์„ฑ๋˜๋Š”๋ฐ์š”.

๋ชจ๋“  step์—์„œ context vecotr๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด hidden state์˜ weighted sum์ž…๋‹ˆ๋‹ค.

์ด์ œ ๋‘ ๊ฐ€์ง€ ์งˆ๋ฌธ์ด ๋‚˜์˜ต๋‹ˆ๋‹ค.

  1. context vector๋Š” ์–ด๋–ป๊ฒŒ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ธ์ง€?
  2. weight 1, 2, 3์€ ์–ด๋–ป๊ฒŒ ๊ฒฐ์ •๋˜๋Š” ๊ฒƒ์ธ์ง€?

context vector๋Š” hidden state vecotr์™€ ๊ฒฐํ•ฉ๋˜๊ณ , ์ƒˆ๋กœ์šด attention hidden vector๋Š” ํ•ด๋‹น ์‹œ์ ์˜ output์„ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ score๋ฅผ ์˜ˆ์ธกํ• ๊นŒ์š”?

์ด๊ฒƒ์€ ์ดˆ๊ธฐ์— seq2seq ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จ๋œ ๋‹ค๋ฅธ neural network model์ธ alignment model์˜ ์ถœ๋ ฅ์ž…๋‹ˆ๋‹ค.

alignment model์€ input(represented by its hidden state)์ด ์ด์ „ output (represented by attention hidden state)๊ณผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ผ์น˜ํ•˜๋Š”์ง€ ์ ์ˆ˜๋ฅผ ๋งค๊ธฐ๊ณ , ๋ชจ๋“  input์— ๋Œ€ํ•ด ์ด์ „ output๊ณผ ์ผ์น˜์‹œํ‚ต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ softmax๊ฐ€ ๋ชจ๋“  ์ ์ˆ˜์— ์ ์šฉ๋˜๊ณ  ๊ฐ ์ž…๋ ฅ์— ๋Œ€ํ•œ attention score๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ์ด์ œ output sequence์˜ ๊ฐ instance๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ input์˜ ์–ด๋Š ๋ถ€๋ถ„์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ์ง€ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

training ๋‹จ๊ณ„์—์„œ model์€ output sequence์—์„œ input sequence๋กœ ๋‹ค์–‘ํ•œ instance๋ฅผ ์ •๋ ฌํ•˜๋Š”๋ฐ์š”.

์•„๋ž˜ ๊ทธ๋ฆผ์€ matrix ํ˜•ํƒœ๋กœ ํ‘œ์‹œ๋œ machine translation์˜ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

๊ฐ ํ•ญ๋ชฉ์€ input ๋ฐ output sequence์˜ attention score์ž…๋‹ˆ๋‹ค.

์ด์ œ ์ตœ์ข…์ ์œผ๋กœ ์™„์ „ํ•œ ๋ชจ๋ธ์ด ์™„์„ฑ๋ฉ๋‹ˆ๋‹ค.


Seq2Seq Attention Based Model


Summary

์—ฌ๊ธฐ์„œ ์ƒˆ๋กญ๊ฒŒ ๋‚˜์˜จ keyword ๋“ฑ์€

context vector, alignment model, attention score

๋“ฑ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น ๊ฐœ๋…์— ๋Œ€ํ•ด ์ดํ•ดํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•˜๋‹ค ๋ณด๋ฉด, attention mechanism์„ ์ดํ•ดํ•˜๋Š”๋ฐ ๋” ๋„์›€์ด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

profile
AI Researcher ๊ฐ€ ๋˜๊ณ  ์‹ถ์–ด์š”

0๊ฐœ์˜ ๋Œ“๊ธ€