[DL] Sequence to Sequence model

๋ฏธ๋‚จ๋กœ๊ทธยท2022๋…„ 3์›” 11์ผ
0

NLP

๋ชฉ๋ก ๋ณด๊ธฐ
5/6

Reference

๐Ÿ“„ A Simple Introduction to Sequence to Sequence Models

Seq to Seq๋ฅผ ๊ณต๋ถ€ํ•˜๋‹ค๊ฐ€ Architecture์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ์ž˜ ๋˜์ง€ ์•Š์•„, ํ•ด๋‹น ํฌ์ŠคํŒ…์„ ์ฐพ์•˜๊ณ  ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ณต๋ถ€ํ•˜๊ณ  ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๋‚ด์šฉ์— ์˜ค๋ฅ˜๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ ๋Œ“๊ธ€๋กœ ์•Œ๋ ค์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.


Overview

sequence to sequence models์€ machine translation, video captioning, image captioning, question answering ๋“ฑ์˜ ๋ถ„์•ผ์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

ํ•ด๋‹น ๊ฐœ๋…์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด neural network์— ๋Œ€ํ•œ ๊ฐœ๋…์„ ์ž˜ ์•Œ๊ณ  ์žˆ์–ด์•ผ ํ•˜๋ฉฐ, ํŠนํžˆ RNN(recurrent neural network), LSTM(long-short-term-memory), GRU model์— ๋Œ€ํ•ด ์•Œ๊ณ  ๊ณ„์‹œ๋Š”๊ฒŒ ์ข‹์Šต๋‹ˆ๋‹ค!

์ €๋Š” ์ด์ƒํ•˜๊ฒŒ RNN, LSTM, GRU ์ด์ƒ seq to seq, attention, transformer ... ๋“ฑ ๋’ท๋ถ€๋ถ„ ์ง„๋„๊ฐ€ ์ž˜ ์•ˆ ๋‚˜๊ฐ”๋Š”๋ฐ, ํฌ๊ธฐํ•˜์ง€ ์•Š๊ณ  ๊ณ„์† ๊ฐ์ž์˜ ๋ŒํŒŒ๊ตฌ๋ฅผ ์ฐพ์•˜์œผ๋ฉด ์ข‹๊ฒ ์Šต๋‹ˆ๋‹ค.


Use Cases

Sequence to sequence ๋ชจ๋ธ์€ ์•ž์—์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ ์ผ์ƒ์ ์œผ๋กœ ์ ‘ํ•˜๋Š” ์ˆ˜๋งŽ์€ ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด seq2seq ๋ชจ๋ธ์€ Google ๋ฒˆ์—ญ, ์Œ์„ฑ ์ง€์› ์žฅ์น˜ ๋ฐ ์˜จ๋ผ์ธ ์ฑ—๋ด‡๊ณผ ๊ฐ™์€ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ์ผ๋ถ€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์ž…๋‹ˆ๋‹ค.

Machine translation

Speech recognition

๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

seq2seq(sequence to sequence)๋Š” ๋ชจ๋ธ์€ ํŠนํžˆ sequence ๊ธฐ๋ฐ˜์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ์†”๋ฃจ์…˜์ด๋ฉฐ, ํŠนํžˆ input๊ณผ output์˜ size๋‚˜ category๊ฐ€ ๋‹ค๋ฅธ ์ ์˜ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.


encoder-decoder architecture

seq2seq์—๋Š” encoder์™€ decoder๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

encoder

  • encoder์™€ decoder๋Š” ๋ชจ๋‘ LSTM ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
  • encoder๋Š” input sequence๋ฅผ ์ฝ๊ณ  internal state vectors ๋‚˜ context vector(LSTM์˜ hidden state, cell state vectors๋ผ๊ณ ๋„ ํ•ฉ๋‹ˆ๋‹ค.)๋กœ ์ •๋ณด๋ฅผ ์š”์•ฝํ•ฉ๋‹ˆ๋‹ค.
  • encoder์˜ ์ถœ๋ ฅ์€ ๋ฒ„๋ฆฌ๊ณ , internal states๋งŒ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • context vector๋Š” decoder๊ฐ€ ์ •ํ™•ํ•œ ์˜ˆ์ธก์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ชจ๋“  input elements์— ๋Œ€ํ•ด ์ •๋ณด๋ฅผ ์••์ถ•ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
  • hidden state hih_i๋Š” ๋‹ค์Œ ๊ณต์‹์— ์˜ํ•ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

LSTM์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ์ฝ๋Š”๋ฐ์š”.

์ž…๋ ฅ์˜ ๊ธธ์ด 't'์˜ sequence์˜ ๊ฒฝ์šฐ LSTM์€ 't'์˜ time step์—์„œ ์ด๋ฅผ ์ฝ์Šต๋‹ˆ๋‹ค.

  1. xix_i: i๋ฒˆ์งธ time step์˜ input sequence
  2. hih_i ๋ฐ cic_i: LSTM์€ ๊ฐ time step์—์„œ 2๊ฐ€์ง€ state(hidden state์˜ ๊ฒฝ์šฐ 'h', cell state์˜ ๊ฒฝ์šฐ 'c')๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  3. yiy_i: time step i์—์„œ์˜ output sequene. yiy_i๋Š” ์‹ค์ œ๋กœ softmax activation์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์„ฑ๋œ ์ „์ฒด vocabulary์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ฐ yiy_i๋Š” ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” 'vocab_size' ํฌ๊ธฐ์˜ vector์ž…๋‹ˆ๋‹ค.

decoder

  • decoder๋Š” ์ดˆ๊ธฐ state๊ฐ€ ์ธ์ฝ”๋” LSTM์˜ ์ตœ์ข… state๋กœ ์ดˆ๊ธฐํ™”๋˜๋Š” LSTM์ž…๋‹ˆ๋‹ค.
  • ์ฆ‰ encoder์˜ ์ตœ์ข… cell์˜ context vector๊ฐ€ decoder network์˜ ์ฒซ ๋ฒˆ์งธ cell์— ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.
  • ์ด๋Ÿฐ initial states์„ ์‚ฌ์šฉํ•ด์„œ decoder๋Š” output sequence ์ƒ์„ฑ์„ ์‹œ์ž‘ํ•˜๊ณ , ์ถœ๋ ฅ์€ ํ–ฅํ›„ ์ถœ๋ ฅ์—๋„ ๊ณ„์† ๊ณ ๋ ค๋ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๋Ÿฌ LSTM unit์˜ stack์œผ๋กœ ๊ฐ unit์ด time step t์—์„œ ์ถœ๋ ฅ yty_t๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ ๋ฐ˜๋ณต unit์€ ์ด์ „ unit์˜ hidden state๋ฅผ ๋ฐ›์•„๋“ค์ด๊ณ , ์ž์ฒด hidden state ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ƒ์„ฑํ•˜๊ณ  ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  hidden state hih_i๋Š” ๋‹ค์Œ ๊ณต์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

์‹œ๊ฐ„ ๋‹จ๊ณ„ t ์—์„œ ์ถœ๋ ฅ y_t ๋Š” ๋‹ค์Œ ๊ณต์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

๊ฐ ๊ฐ€์ค‘์น˜ W(S)W(S)์™€ ํ•จ๊ป˜ ํ˜„์žฌ time step์—์„œ hidden state๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ output์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

Softmax๋Š” ์ตœ์ข… ์ถœ๋ ฅ(์˜ˆ: question-answering problem์˜ word)๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” ํ™•๋ฅ  ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

์œ„์˜ ์ด๋ฏธ์ง€์™€ ๊ฐ™์ด output sequence์— ๋‘ ๊ฐœ์˜ token์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.


Example

์œ„์— reference๋กœ ๋‚จ๊ธด posting์—์„œ ์˜ˆ์‹œ ๋ฌธ์žฅ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

โ€œSTART_ John is hard working _ENDโ€.

๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ ์€ decoder์˜ ์ดˆ๊ธฐ state(h0,c0h_0, c_0)๊ฐ€ encoder์˜ ์ตœ์ข… state๋กœ ๊ฒฐ์ •๋œ๋‹ค๋Š” ์ ์ธ๋ฐ์š”.

์ด ์‚ฌ์‹ค์€ decoder๊ฐ€ encoder์— ์˜ํ•ด encoding๋œ ์ •๋ณด์— ๋”ฐ๋ผ output sequence๋ฅผ ์‹œ์ž‘ํ•˜๋„๋ก ํ›ˆ๋ จ๋˜์—ˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ ๊ฐ time step์—์„œ ์˜ˆ์ธก๋œ output์— ๋Œ€ํ•ด loss๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , network์˜ parameter๋ฅผ updateํ•˜๊ธฐ ์œ„ํ•ด error๋ฅผ time์— ๋”ฐ๋ผ backpropagationํ•ฉ๋‹ˆ๋‹ค.

์ถฉ๋ถ„ํžˆ ๋งŽ์€ ์–‘์˜ data๋กœ ๋” ์˜ค๋žœ ๊ธฐ๊ฐ„ ๋™์•ˆ network๋ฅผ ํ›ˆ๋ จํ•˜๋ฉด ๋” ์ข‹์€ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Overall Encoder-Decoder Architecture

  • inferenceํ•˜๋Š” ๋™์•ˆ ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • decoder์˜ ์ดˆ๊ธฐ state๋Š” encode์˜ ์ตœ์ข… state์— ์˜ํ•ด ์„ค์ •๋ฉ๋‹ˆ๋‹ค.
  • ์ดˆ๊ธฐ input์€ ํ•ญ์ƒ START token์ž…๋‹ˆ๋‹ค.
  • ๊ฐ time step์—์„œ decoder์˜ state๋ฅผ ์œ ์ง€ํ•˜๊ณ , ๋‹ค์Œ time step์˜ initial state๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ time step์—์„œ ์˜ˆ์ธก๋œ ์ถœ๋ ฅ๊ฐ’์€ ๋‹ค์Œ time step์˜ input์œผ๋กœ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.
  • decoder๊ฐ€ END token์„ ์˜ˆ์ธกํ•  ๋•Œ loop๋ฅผ ๋Š์Šต๋‹ˆ๋‹ค.

Encoder-Decoder Models์˜ ์žฅ๋‹จ์ 

  1. memory๊ฐ€ ์•„์ฃผ ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค.
    ์—ฌ๊ธฐ์„œ S ๋˜๋Š” W๋ผ๊ณ  ๋ถ€๋ฅด๋Š” LSTM์˜ ๋งˆ์ง€๋ง‰ hidden state๋Š” ๋ฒˆ์—ญํ•ด์•ผ ํ•˜๋Š” ์ „์ฒด ๋ฌธ์žฅ์„ ๋ฒผ๋ฝ์น˜๊ธฐํ•˜๊ธฐ ์œ„ํ•ด ๋…ธ๋ ฅํ•˜๋Š” ํŒŒํŠธ์ž…๋‹ˆ๋‹ค.
    S ๋˜๋Š” W๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๊ธธ์ด๊ฐ€ ๋ช‡ ๋ฐฑ ๋‹จ์œ„์— ๋ถˆ๊ณผํ•œ๋ฐ, ์ด ๊ณ ์ • ์ฐจ์› vector์—์„œ ๊ฐ•์ œ๋กœ ์ ์šฉํ•˜๋ ค๊ณ  ํ•˜๋ฉด neural network๊ฐ€ ๋” ๋งŽ์ด ์†์‹ค๋ฉ๋‹ˆ๋‹ค.
    lossy compression์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ธก๋ฉด์—์„œ neural network๋ฅผ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ๋” ์œ ์šฉํ•˜๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  2. ์ผ๋ฐ˜์ ์œผ๋กœ ์‹ ๊ฒฝ๋ง์ด ๊นŠ์„์ˆ˜๋ก ํ›ˆ๋ จํ•˜๊ธฐ๊ฐ€ ๋” ์–ด๋ ต์Šต๋‹ˆ๋‹ค. recurrent neural network์˜ ๊ฒฝ์šฐ sequence๊ฐ€ ๋” ๊ธธ์ˆ˜๋ก neural network๋Š” time step์— ๋”ฐ๋ผ ๋” ๊นŠ์–ด์ง‘๋‹ˆ๋‹ค.
    ์ด๋กœ ์ธํ•ด RNN์ด ํ•™์Šตํ•˜๋Š” ๋ชฉํ‘œ์˜ gradient๊ฐ€ backward ๊ณผ์ •์—์„œ vanishing๋ฉ๋‹ˆ๋‹ค.
    LSTM๊ณผ ๊ฐ™์ด gradient vanishing์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํŠน๋ณ„ํžˆ ์ œ์ž‘๋œ RNN์„ ์‚ฌ์šฉํ•˜๋”๋ผ๋„ ์ด๋Š” ์—ฌ์ „ํžˆ ๊ทผ๋ณธ์ ์ธ ๋ฌธ์ œ์ ์œผ๋กœ ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ๋ณด๋‹ค ๊ฐ•๋ ฅํ•˜๊ณ  ๊ธด ๋ฌธ์žฅ์„ ์œ„ํ•ด Attention Models ๋ฐ Transformer์™€ ๊ฐ™์€ ๋ชจ๋ธ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ๊นŒ์ง€ seq2sqe์˜ encoder, decoder๋ฅผ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

profile
๋ฏธ๋‚จ์ด ๊ท€์—ฝ์ฃ 

0๊ฐœ์˜ ๋Œ“๊ธ€