[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Neural Machine Translation by Jointly Learning to Align and Translate ๐ŸชŸ

Jhyuneeยท2024๋…„ 2์›” 7์ผ
0

Papers

๋ชฉ๋ก ๋ณด๊ธฐ
2/2
post-thumbnail

Summary โ•

๐Ÿ’ก ๊ธฐ์กด Encoder-Decoder ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ๋ฒˆ์—ญ ์‹œ ๋ฐœ์ƒํ•˜๋Š”
๊ธด ์ž…๋ ฅ ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ์ €ํ•˜ ์›์ธ์„ โ€™fixed-lengthโ€™์—์„œ ์ฐพ์•„,
์ƒˆ๋กœ์šด ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์ œ์‹œ

โ‡’ Variable-length Encoding & Attention Decoding by using
context vector

ย 

Review ๐Ÿ—’๏ธ

0. Abstract

Previous : Neural Machine Translation with encoder-decoder
Flow : Source ๋ฌธ์žฅ โ†’ encoder โ†’ โ€œfixed-lengthโ€ vector โ†’ decoder โ†’ output
์ด๋•Œ, using โ€œfixed-lengthโ€ : bottleneck ์œ ๋ฐœ!

In this paper ; Automatically soft-search

  • ์˜ˆ์ธกํ•  target word์™€ ์—ฐ๊ด€์„ฑ์ด ์žˆ๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ํŒŒํŠธ๋ฅผ ํƒ์ƒ‰, w/o giving hard segmentation explicitly.

1. Introduction

Traditional translation : Phase-based translation system

  • Sub-components tuned seperately

Previous Neural translation : Train a single, large network, sentence-unit

  • ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๋ชจ๋“  ์ •๋ณด๋ฅผ ๊ณ ์ •๋œ ๊ธธ์ด โ€œfixed-lengthโ€์— ์••์ถ•ํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ
    • ๊ธด ๋ฌธ์žฅ์— ์„ฑ๋Šฅ ์ €ํ•˜ ์œ ๋ฐœ

In this paper : "context vector" ์ด์šฉ

Input ๋ฌธ์žฅ์ด ๊ธธ์–ด์งˆ ๋•Œ ์ƒ๊ธฐ๋Š” ์ •๋ณด ์••์ถ• ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์ฑ… ์ œ์‹œ - ๋งค ๋‹จ์–ด ์ƒ์„ฑ๋งˆ๋‹ค Automatically soft-search

  • Most relevant information์ด ์ง‘์ค‘๋œ ์œ„์น˜๋ฅผ ์ฐพ์•„ โ‡’ context vector ์ƒ์„ฑ
    • ํ•ด๋‹น ์œ„์น˜ ์ •๋ณด(Source position ์ •๋ณด)๋ฅผ ํฌํ•จ
    • Contect vector + ์ด์ „ ์˜ˆ์ธก ๋‹จ์–ด๋“ค โ†’ target word prediction!

Mechanism : โœ”๏ธ

์ฆ‰, Encoder๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ์„ sequence of vectors๋กœ ๋ณ€ํ™˜ํ•˜๊ณ ,
Decoder๋Š” ์ธ์ฝ”๋”์˜ ์ถœ๋ ฅ ๋ฒกํ„ฐ ์ค‘, ํ•„์š”ํ•œ ๋ถ€๋ถ„(subset)์„ ๊ณจ๋ผ์„œ -Context vector๋ฅผ ์ด์šฉํ•˜์—ฌ- ์‚ฌ์šฉํ•œ๋‹ค.
์ด๋ฅผ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํšจ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

  1. ์„ฑ๋Šฅ ํ–ฅ์ƒ
  2. ์–ธ์–ด์ ์œผ๋กœ ๋” ์ ํ•ฉํ•œ(์ž์—ฐ์Šค๋Ÿฌ์šด) ๋ณ€ํ™˜

2. Background : Neural Machine Translation

Translation task == ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๋ฌธ์žฅ์„ ํƒ์ƒ‰ํ•˜๋Š” ์ž‘์—…์ด๋‹ค.
: Conditional probability of YY, given a source sentence XX = P(Yย โˆฃย X)P(Y \ | \ X)

2.1 RNN Encoder-Decoder

Encoder :

์ž…๋ ฅ ๋ฌธ์žฅ์„ ๋ฐ›์•„ Context vector cc๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ƒ์„ฑํ•œ๋‹ค.

  • Input sentence โ†’ Encoder โ†’ Sequence of vectors
    X=(x1,โ€ฆ,xTx)X = (x_1, โ€ฆ, x_{T_x}) โ†’ Encoder โ†’ cc (variable-length, contect vector)

    RNN :

    (1)ย ย ย ย ย ย ht=f(xt,ht1)ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย c=q({h1,...,hTx})(1) \ \ \ \ \ \ h_t = f(x_t, h_{t_1}) \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ c = q(\{h_1, ..., h_{T_x}\})

    where hth_t ; hidden state at time t
    cc ; context vector generated from hidden states
    f,qf, q ; some nonlinear functions
    ex.) ff can be LSTM

ย 

Decoder :

์ž…๋ ฅ์„ ๋ฐ›์•„ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค.

  • Trained to predict the next word yty_t
    (given cc & previously predicted words (y1,โ€ฆ,ytโ€ฒโˆ’1)({y_1, โ€ฆ, y_{{t}\prime-1}}))

    ์ฆ‰, Decoder defines a probability over the translation YY, ํ•ด๋‹น ๋‹จ์–ด๋กœ ๋ณ€ํ™˜(์˜ˆ์ธก)๋  ์กฐ๊ฑด๋ถ€ํ™•๋ฅ ์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•œ๋‹ค.

    (2)ย ย ย ย ย ย ย p(Y)=โˆt=1Tp(ytย โˆฃย y1,...,ytโˆ’1,ย c)(2) \ \ \ \ \ \ \ p(Y) = \prod^{T}_{t=1}p(y_t \ | \ {y_1, ..., y_{t-1}}, \ c)

    Y=(y1,โ€ฆ,yTy)Y = (y_1, โ€ฆ, y_{T_y}), RNN :

    (3)ย ย ย ย ย ย ย p(ytย โˆฃย y1,...,ytโˆ’1,ย c)=g(ytโˆ’1,st,c)(3) \ \ \ \ \ \ \ p(y_t \ | \ {y_1, ..., y_{t-1}}, \ c) = g(y_{t-1}, s_t, c)

    Where g ; nonlinear, potentially multi-layered, yty_t ํ™•๋ฅ  ๊ณ„์‚ฐ ํ•จ์ˆ˜
    sts_t ; hidden state of RNN


3. Learning to Align and Translate

  • Bidirectional RNN Encoder & Searching Decoder

3.1 Decoder : General Description

Conditional probability : We define (2)(2) as

(4)ย ย ย ย ย ย ย ย p(yiย โˆฃย y1,...,yiโˆ’1,ย X)=g(yiโˆ’1,si,ci)(4) \ \ \ \ \ \ \ \ p(y_i \ | \ {y_1, ..., y_{i-1}}, \ X) = g(y_{i-1}, s_i, c_i)

Where sis_i ; RNN hidden state of time i,

si=f(siโˆ’1.yiโˆ’1,ci)s_i = f(s_{i-1}. y_{i-1}, c_i)

์‹ (2)(2)์™€์˜ ์ฐจ์ด์  : cic_i
๊ฐ target word yiy_i๋งˆ๋‹ค ์กฐ๊ฑด๋ถ€ํ™•๋ฅ ์„ ์ •์˜ํ•˜๋Š” cc๊ฐ€, time ii๋งˆ๋‹ค ๊ฐœ๋ณ„์ ์œผ๋กœ ์ง€์ •๋˜์–ด ์žˆ๋‹ค.

Context vector : cic_i

cic_i๋Š” ์ธ์ฝ”๋”์˜ ์ถœ๋ ฅ(mapped input sentence) - (h1,โ€ฆ,hTx)(h_1, โ€ฆ, h_{T_x}) - ์— ์˜ํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด ๊ฒฐ์ •๋œ๋‹ค.

์ด๋•Œ, ๊ฐ๊ฐ์˜ hih_i๋Š” ์ „์ฒด ๋ฌธ์žฅ(input sequence)์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง„๋‹ค.
(Containing strong focus on suroundings of ii-th word.)

  • cic_i๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฐ hjh_j์— ๋Œ€ํ•œ ๊ฐ€์ค‘ํ•ฉ(weighted sum)์œผ๋กœ์จ ๊ณ„์‚ฐ๋˜๋ฉฐ,

    ci=โˆ‘j=1Txฮฑijhjc_i = \sum^{T_x}_{j=1}\alpha_{ij}h_j

    ๊ฐ€์ค‘์น˜ ฮฑij\alpha_{ij}๋Š” xjx_j๋กœ๋ถ€ํ„ฐ yiy_i๊ฐ€ ์˜ˆ์ธก๋  ํ™•๋ฅ ์ด๋‹ค.

    ์ฆ‰, yiy_i๋ฅผ ์ƒ์„ฑํ•  ๋•Œ hjh_j์˜ 'importance(์ค‘์š”๋„)'๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, ๊ทธ๊ฒƒ์„ ๊ฐ€์ค‘์น˜๋กœ์จ cic_i์— ๋ฐ˜์˜ํ•œ๋‹ค. (ฮฑi,ย eij\alpha_i, \ e_{ij} reflectes the importance of the annotation hjh_j)

    Context vector cic_i๋Š”, jj ์œ„์น˜์˜ ๋‹จ์–ด xjx_j๊ฐ€ ii๋ฒˆ์งธ ์ถœ๋ ฅ ๋‹จ์–ด yiy_i์— ๋Œ€ํ•œ relevent information์„ ์–ผ๋งˆ๋‚˜ ๊ฐ€์ง€๋Š”์ง€์˜ ์ •๋ณด๋ฅผ ๋‹ด๊ธฐ ๋•Œ๋ฌธ!

    โ‡’ ์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ, 1.์˜ Mechanism์— ์–ธ๊ธ‰๋œ ๋ฐ”์™€ ๊ฐ™์ด,
    Decoder๊ฐ€ ์ธ์ฝ”๋”์˜ ์ถœ๋ ฅ ๋ฒกํ„ฐ ์ค‘, ์ค‘์š”ํ•œ(ํ•„์š”ํ•œ) ๋ถ€๋ถ„(subset)์„ ๊ณจ๋ผ์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๊ฒŒ ํ•œ๋‹ค.

    • ๊ทธ๋ฆฌ๊ณ  ๊ฐ€์ค‘์น˜ ฮฑij\alpha_{ij}๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋œ๋‹ค.
    • ์ด๋•Œ, eij=a(siโˆ’1,hj)e_{ij} = a(s_{i-1}, h_j) ์ด๋‹ค. (aa is a feedforward neural network)
      ฮฑij=exp(eij)โˆ‘k=1Tโˆ’xexp(eik)\alpha_{ij} = {\text{exp}(e_{ij}) \over \sum^{T-x}_{k=1} \text{exp}(e_{ik}) }
      ย 

    ์ด๋ฅผ ํ†ตํ•ด Decoder์˜ Attention mechanism์ด ๊ตฌํ˜„๋˜๋Š”๋ฐ,
    ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ์–ด๋–ค ๋ถ€๋ถ„์— ์ง‘์ค‘ํ• ์ง€ Decoder๊ฐ€ ๊ฒฐ์ •(ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ž„)ํ•˜๊ฒŒ ๋œ๋‹ค.

โœ”๏ธ Encoder๊ฐ€ ์ž…๋ ฅ๋œ ์ „์ฒด ๋ฌธ์žฅ์„ fixed-length๋กœ ์••์ถ•ํ•˜๋Š” ๋ถ€๋‹ด์„ ๋œ์–ด์ค€๋‹ค.

ย 

3.2 Encoder : Bidirectional RNN for Annotating Sequences

BiRNN :

๊ฐ ๋‹จ์–ด๊ฐ€ ์ž๊ธฐ ์ž์‹ ์˜ ์ด์ „ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ์ •๋ณด๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ดํ›„ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ์ •๋ณด๊นŒ์ง€ ์–ป์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.
(For summarizing not only the preceding words, but also the following words.)

  • Forward RNN fโ†’\overrightarrow{f} : reads input from (x1x_1 to xTxx_{T_x})
    • Calculates forward hidden states = (h1โ†’,โ€ฆ,hTxโ†’)(\overrightarrow{h_1}, โ€ฆ, \overrightarrow{h_{T_x}})
  • Backward RNN fโ†\overleftarrow{f} : reads input from ( xTxx_{T_x} to x1x_1)
    • Calculates backward hidden states = (h1โ†,โ€ฆ,hTxโ†)(\overleftarrow{h_1}, โ€ฆ, \overleftarrow{h_{T_x}})
  • Concatenate hj=[ย hjโ†’ย ;ย hjโ†T]Th_j = [\ \overrightarrow{h_j} \ ; \ \overleftarrow{h_j}^T]^T

โ‡’ ๊ฒฐ๊ณผ์ ์œผ๋กœ hjh_j๋Š” ์ด์ „, ์ดํ›„ ๋ฌธ๋งฅ์— ๋Œ€ํ•œ ๋ชจ๋“  ์ •๋ณด๋ฅผ ๊ฐ€์ง„๋‹ค.

profile
์ข‹์•„ํ•˜๋Š” ๊ฒƒ ๋งŽ์€ ์‚ฌ๋žŒ

0๊ฐœ์˜ ๋Œ“๊ธ€