[NLP 논문 리뷰] NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

fla1512·2022년 10월 26일

NLP Study

목록 보기

9/23

논문 소개 참고

Attention mechanism이 처음으로 소개된 논문

attention보다는 soft-alignment로 논문이 소개

용어 정리
1. Soft-alignment(attention): source word -> target에 대한 정보를 스스로 alignment를 학습하여 사람이 직접 알려주지 않는 방법
2. Hard-alignment: source sentence의 word를 target sentence의 word로 사람이 직접 지정(예. 'I am hungry', '나는 배고프다.' -> I => 나, am => 는)
3. Context vector c: 문장의 의미를 담고 있는 fixed length vector
4. Annotation: Encoder의 hidden state, 수식에서 h로 표현
5. hidden state: t번째 input에 대한 정보를 담고 있는 vector(특정 언급이 없는 경우, decoder의 hidden state를 의미), 수식에서 s로 표현

Abstract

Neural machine translation
- machine translation에 최근에 제안된 approach다.
- 이전의 traditional statistical machine translation과 달리,
  - a single neural network를 build하는 것을 목표로 둔다
    (translation performance를 최대화하고자 jointly tuned가 가능하다)
- 최근에 해당 분야에 제시된 모델들은,
  - a family of encoder–decoders에 소속된다
  - a source sentence를 fixed-length vector(a decoder generates a translation) 로 encode한다
해당 논문에서는
- fixed-length vector를 사용하는 것이 bottleneck(basic encoder–decoder 아키텍처의 성능을 향상하는데 있어서)이라고 추측한다
- 그래서, 이것을 확장할 것을 제안한다
  - 어떻게? model이 자동으로 (soft-)search for parts of a source sentence하게 해서
    (that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.)
새로운 제안법으로
- English-to-French translation 태스크에서 existing state-of-the-art phrase-based system에서 비교할만한 translation performance 성취
- qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

1. Introduction

Neural machine translation은 machine translation에 새롭게 등장하는 접근 방식으로, Kalchbrenner and Blunsom (2013), Sutskever et al. (2014) 그리고 Cho et al. (2014b)에 의해 제안되었다.
이전의 전통적인 phrase-based translation system은 많은 작은 sub-components(tunded separately)로 이루어졌다
그와 달리, neural machine translation은 single이며 large neural network(문장을 읽고, 정확한 번역을 내는 것이 가능한)를 build하고 train하고자 한다
제안된 대다수의 neural machine translation models는 encoder-decoder (Sutskever et al., 2014; Cho et al., 2014a)의 family로 소속된다
- 각 언어에 대해서 encoder와 decoder를 가지고 있거나
- 각 sentence에 적용되는 language-specific encoder를 포함한다
- encoder neural network는 source sentence를 읽어서 fixed length vector로 encodes한다
- decoder에서 encoded vector로부터 translation을 출력한다
- 전체 encoder-decoder sysyem은 jointly trained된다
  - -> 문장이 주어졌을 때 correct translation을 하도록 확률을 최대화해준다
  - language pair를 위해 encoder-decoder로 이루어져있다
encoder-decoder 기반의 potential issue는
- neural network가 모든 필요한 정보(source sentence -> fixed length vector)를 압축하는 것이 필요하다는 것이다
  -> 이로 인해, 긴 문장을 다루는 것이 어려울 수 있다(특히 training corpus에 있는 문장들보다 긴 것들)
- Cho et al. (2014b) : basic encoder–decoder의 성능은 input sentence의 길이가 길어질 수록 급격하게 악화됨을 입증했다
해당 문제를 해결하고자, 본 논문에서는 encoder-decoder model의 extension을 소개한다
- learns to align and translate jointly.
- 해당 모델은 각 시간마다 translation에서 단어를 생성한다
- 단어는 source sentence에서 set of positions를 찾는다(대다수의 관련 정보가 concentrated되어있다)
- model은 그러면 context vectors(source postitions와 연관이 있고 모든 이전에 생성된 target words와 연관이 있는)에 기반해서 target word를 예측한다
해당 approach에서 가장 중요한 특징은,
- 전체 input sentence를 single fixed-length vector로 encode하려고 시도하지 않는다는 것이다
- 그 대신, input sentence를 sequence of vectors로 encodes하고, translation을 decoding 하는 동안에 vector의 subset을 적절하게 고른다
- 이는 neural translation model을 자유롭게 해준다
  - source sentence의 모든 정보를 짓누르는 것으로부터(길이에 상관없이 고정된 길이의 벡터로)
    -> 이것이 model이 긴 문장에서 더 잘 됨을 입증했다
해당 논문에서는
- jointly learning to align and translate하는 방식이 기존의 encoder-decoder approach를 넘어서 translation performance를 향상했음을 보였다
- 향상은 longer sentences에서 더 명백하다
- 어떠한 길이의 문장으로도 observed될 수 있다
- English-to-French translation task
  - 고안된 방식은 single model로 conventional phrase-based sysytem과 견줄만한 성과를 얻었다
  - 더 나아가, qualitative 분석은 proposed model이 source sentence와 상응하는 target sentence에서 linguistically plausible(타당한) alignment를 찾았음을 보였다

2 BACKGROUND: NEURAL MACHINE TRANSLATION

probabilistic perspective에서 translation은 target sentence y를 찾는 것과 동일하다
- target sentence y: x라는 source sentence가 주어졌을 때, y라는 조건부 확률을 최대화하는 i.e., arg max y p(y | x)
- neural machine translation에서 parallel training corpus를 사용하면서parameterized model을 fit해서 sentence pairs의 conditional probability을 최대화한다
- conditional distribution이 translation model에 의해서 학습되면
  - source sentence가 주어지면서 conditional probability를 최대화하는 문장을 찾으면서 corresponding translation이 생성된다
최근에 conditional distribution을 학습하고자, 많은 연구들이 neural network의 사용을 제안했다(e.g., Kalchbrenner and Blunsom, 2013; Cho et al., 2014a; Sutskever et al., 2014; Cho et al., 2014b; Forcada and Neco, 1997)
- 이런 neural machine translation approach는 일반적으로 두 구성요소로 이루어진다
  1. source sentence x를 encodes한다
  2. target sentence y로 deocdes한다
  - 예) two current neural networks(RNN, Cho et al., 2014a, Sutskever et al., 2014)은 variable-length source sentence를 fixed length vector로 encode하는데 쓰였고, 그 벡터를 variable-length target sentence로 decode하였다
neural machine translation은 이미 promising한 결과들을 보여왔다
- Sutskever et al. (2014): LSTM units를 가진 neural machine translation을 기반으로 하는 RNNs은 English-to-French translation task에서 state-of-the-art performance를 가깝게 성취했다
- 기존의 translation system에 neural components를 추가해서, phrase table (Cho et al., 2014a)에서 phrase pairs를 score하거나 candidate translations를 re-rank해 이전의 state-of-the-art performance level를 능가하는 것이 가능해졌다

2.1 RNN ENCODER–DECODER

RNN Encoder–Decoder의 framework에 대한 설명
- 동시에 align과 translate를 학습하는 것이 가능한 novel architecture를 build했다
- Encoder–Decoder framework에서 encoder는 input sentence를 읽는다
  (sequence of vectors x(x1,...xtx)를 vector c로)
- 가장 흔한 방식은 RNN을 쓰는 것이다
  - ht=f(xt, ht-1)
  - c=q({ht1, ... , htx})
  - ht는 time t에서의 hidden state이고, c는 vector(hidden states의 sequence에서 생성), f와 q는 nonlinear functions
- 예를 들어, Sutskever et al. (2014)는 LSTM을 f로 q ({h1, · · · , hT }) = hT로 사용
- decoder는 next word yt'를 예측하고자 훈련된다
  - context vector c와 모든 이전에 예측된 단어 {y1, · · · , yt0−1}가 주어진다
  - decoder는 translation y를 넘어서 probability를 정의한다
    (joint probability를 ordered conditionals로 분해하면서)
- RNN으로 각 conditional probability는 다음과 같이 모델링된다
  - p(yt | {y1, · · · , yt−1} , c) = g(yt−1, st, c)
  - g는 nonlinear, potentially multi-layered function이다
  - output으로 yt의 probability, st는 RNN의 hidden state
  - other architectures(예) hybrid of an RNN, a de-convolutional neural network)가 사용될 수 있음이 명시되어야 한다

3 LEARNING TO ALIGN AND TRANSLATE

해당 부분에서는 neural machine translation을 위한 novel architecture를 제안한다
- architecture는 bidirectional RNN을 encoder(sec. 3.2)로
- decoder는 searching through a source sentence during decoding a translation를 모방 (sec. 3.1)

Seq2Seq 구조에서 Attention 매커니즘과 양방향 RNN(bidirectional RNN)을 제안 링크텍스트
1. Seq2Seq 구조: Encoder와 decoder로 구성. encoder가 source sentence를 입력 받아, 고정된 벡터 크기로 반환. 고정된 길이의 벡터가 긴 문장을 번역하는데 문제점이라 판단해 decoder가 어떤 source sentence에 집중해야 하는지 결정하도록 함. -> decoder를 attention 매커니즘으로 작동하게 해, encoder는 source sentence의 모든 정보를 고정된 길이의 벡터로 encode해야하는 부담감을 덜게 됨. 또한 다음 target 단어의 생성과 관련 있는 정보에만 집중할 수 있게 됨
2. 양방향 RNN: 두 개의 RNN을 사용, 하나의 RNN은 input sequence를 순서대로 읽고, forward hidden states의 순서를 계산. 역방향 RNN은 역방향으로 sequence를 읽고, backward hidden states를 계산. 이 둘의 출력 hidden state를 concat으로 연결. 따라서 concat된 hidden state는 단어 x 주변의 단어에 집중할 수 있게 됨.

3.2 ENCODER: BIDIRECTIONAL RNN FOR ANNOTATING SEQUENCES

RNN의 한계의 BIRNN의 등장 배경

일반적인 RNN은 "ht = f (xt, ht−1), c = q ({h1, · · · , hTx})"에 명시된 바와 같이
- input sentence x를 첫번째 symbol x1에서 마지막 xTx까지 순서대로 읽기 시작한다
- 하지만 제안된 scheme에서, 우리는 각 단어의 annotation이 precding words를 summarize할 뿐만 아니라, following words까지 하기를 바란다
- 따라서 bidirectional RNN (BiRNN, Schuster and Paliwal, 1997)를 제안
  - speech recognition (see, e.g., Graves et al., 2013)에서 성공적으로 사용

BiRNN

RNN의 forward와 backward로 구성
forward RNN: input sequence를 정렬된 대로(x1 -> xTx) 읽고, sequence of forward hidden states (h1, · · · ,hTx)를 계산
각 단어 xj에 대한 annotation은, forward hidden state와 backward를 concatenate해서 얻는다
- 해당 방법으로, annotation hj는 preceding words와 following words 둘 다의 summaries를 포함한다
- RNNs가 recent inputs를 더 잘 표현하고자하는 경향성 때문에, annotation hj는 xj 주변에 있는 단어에 집중될 것이다
- annotations에 대한 시퀀스는 decoder에 의해서 사용되고, alignment model은 나중에 context vector를 계산하고자 쓰인다

model architecture

(코드)pytorch에서의 구현 이미지 및 코드

이미지 출처

single layer GRU 사용
bidirectional RNN 사용
- forward RNN: going over the embedded sentence from left to right (연두색)
- backward RNN: going over the embedded sentence from right to left(청록색)
RNN의 결과 값
- hidden
  - which acts as our initial hidden state in the decoder
- outputs
  - the stacked forward and backward hidden states for every token in the source sentence.
- decoder가 bidirectional이 아니어서 single context vector인 z가 필요
  - forward의 hidden state와 backward의 hiddensatae를 concatenate
  - 그 후 linear layer인 g를 통과 후 활성화 함수 tanh 적용
- Note
  - this is actually a deviation from the paper. Instead, they feed only the first backward RNN hidden state through a linear layer to get the context vector/decoder initial hidden state. This doesn't seem to make sense to me, so we have changed it.

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        # bidirectional=True로 설정하여 bi-rnn을 구현
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        
        # 양방향 rnn의 출력값을 concat 한 후에 fc layer에 전달
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        # 입력 x를 임베딩
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        # rnn의 출력값
        outputs, hidden = self.rnn(embedded)
                
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        
        #outputs = [src len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        
        return outputs, hidden

(코드)pytorch에서의 구현 Attention

attention의 일반적인 구현 과정

decoder의 이전 hidden state인 s_(t-1)와 encoder의 hiddenstate H를 입력받아 attention vector at를 출력
at의 길이는 source sentence와 동일하여 각 요소는 0과 1이고, 전체 합은 1
즉, attention vector at를 계산하기 위해 decoder의 hidden state s_(t-1)와 encode의 hidden state H를 입력 받는 것임.
energy를 linear layer와 tanh 활성화 함수로 계산후, v텐서로 곱함(source sentence 길이로 바꾸고자)
output은 attention vector인 at
- the length of the source sentence이며
- each element is between 0 and 1 and the entire vector sums to 1.
- represents which words in the source sentence we should pay the most attention to in order to correctly predict the next word to decode(다음 단어를 예측하고자 어떤 단어에 가장 집중해야 하는지를 알려줌)
과정
1. energy 계산
  - between the previous decoder hidden state and the encoder hidden states
  - 이는 how well each encoder hidden state "matches" the previous decoder hidden state를 계산하는 것으로도 해석 가능
2. v텐서로 곱하기
3. softmax layer 거치기

class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2)))  # 1. energy 계산: linear layer와 tanh 활성화 함수
        
        #energy = [batch size, src len, dec hid dim]

        attention = self.v(energy).squeeze(2) # 2. source sentence 길이로 바꾸고자 v텐서로 곱해줌
        
        #attention= [batch size, src len]
        
        return F.softmax(attention, dim=1) # 3. softmax 함수를 거쳐 0~1 사이의 값을 갖게 함

3.1 DECODER: GENERAL DESCRIPTION

각 conditional probability를 다음과 같이 정의한다
- p(yi|y1, . . . , yi−1, x) = g(yi−1, si, ci)
- si는 time i에 대한 RNN hidden state이다
- si = f(si−1, yi−1, ci)
기존의 encoder-decoder와 달리, 여기서 probability는 distinct content vector ci를 위해 각 target word yi에 대해서 conditioned된다
context vector ci는 a sequence of annotations(h1, · · · , hTx)에 의존한다
거기서 encoder는 input sentence를 maps한다
각 annotation hi는 전체 input sequence에 대한 정보를 포함한다
(input sequence에 대해서 주위의 i번째 단어에 strong focus)
context vector ci가 그러면 annotations hi의 weighted sum으로 계산된다
alignment model은 position j 근처의 input과 position i에서의 output이 얼마나 match되는지 score한다
- score는 RNN hidden state si-1, input sentence의 j-th annotation hj에 기반을 둔다
alignment model a를 feedforward neural network로 parametrize한다
- 제안된시스템에서 다른 구성 요소들과 jointly하게 훈련된다
다른 traditional machine translation과 달리, alignment는 잠재변수로 고려되지 X
-> 그 대신, soft alignment를 directly하게 compute
(cost function의 gradient가 backpropagated되는 것을 허용한다, 해당 gradient는 alignment model 그리고 whole translation model을 jointly하게 train하는데 사용될 수 있다)
taking a weighted sum of all the annotations에 대한 approach를, expected annotation을 계산하는 것으로 이해할 수 있다(where the expectation is over possible alignments)
αij를 probability -> target word yi is aligned to, or translated from, a source word xj
probability αij , associated energy eij는 importance of the annotation hj를 reflects.
(with respect to the previous hidden state si−1 in deciding the next state si and generating yi)
-> decoder에서의 attention mechanism을 시행한다(decoder는 집중할 source sentence의 parts를 결정한다)
decoder가 attention mechanism을 가지게 해서 encoder의 sourcer sentence에 있는 모든 정보를 고정된 길이의 벡터로 인코딩해야 한다는 짐을 덜었다
-> 새로운 접근법을 통해, 정보는 sequence of annotations를 통해 퍼질 수 있다(decoder에 따라서 선택적으로 retrieve 가능하다)

(코드)pytorch에서의 구현 Decoder

decoder는 attention를 포함
- attention layer는 이전 hidden state st−1와 encoder의 모든 hidden state H를 입력 받아 attention vector at를 반환
- attention vector를 이용해서 weighted vector wt를 생성(H는 encoder의 hidden states의 weighted sum)
- 'weighted source vector인 wt'와 '이전 예측값에 임베딩을 적용한 단어인 d(yt)', '이전 decoder hidden state인 st−1'이 GRU에 전달하여 다음 hidden state st를 계산(embedding d(yt)와 wt가 concat된 후 GRU에 전달)
- 예측값은 d(yt),wt,st를 fc layer에 전달하여 계산

연두,청록색: forward/backward encoder RNNs으로서 결과값으로 H
빨간색: context vector인 z
파란색: decoder RNN이고 결과값으로 st 산출
보라색: linear layer인 f로 결과값으로 yhat t+1 산출
주황색: at와 wt로 H가 weighted sum된 계산 결과

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        self.output_dim = output_dim
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        # embedding과 weighted vector가 concat 된 후, 이전 hidden staet와 함께 입력
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        
        # 입력값 d(y_t), w_t, s_t
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)
                
        #a = [batch size, src len]
        
        a = a.unsqueeze(1)
        
        #a = [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)
        
        #weighted = [batch size, 1, enc hid dim * 2]
        
        weighted = weighted.permute(1, 0, 2)
        
        #weighted = [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden.squeeze(0)

4 EXPERIMENT SETTINGS

English-to-French translation task로 접근방식을 평가한다
ACL WMT ’14가 제공한 bilingual, parallel corpora를 사용한다
비교를 위해, RNN Encoder–Decoder의 성능도 보고한다
- 두 모델은 같은 데이터셋과 훈련 절차를 거쳤다

4.1 DATASET

WMT ’14: English-French parallel corpora를 포함
combined corpus를 348M words로 사이즈 조정
monolingual data는 사용X
훈련을 위해 shortlist of 30,000 most frequent 단어를 각 언어에 사용

4.2 MODELS

모델 유형은 두 가지
1. RNN Encoder–Decoder (RNNencdec, Cho et al.,2014a)
2. RNNsearch
각 모델을 두차례 훈련
- first with the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30)
- with the sentences of length up to 50 word (RNNencdec-50, RNNsearch-50)
RNNencdec의 encoder와 decoder는 각각 1000 hidden units를 가짐
RNNsearch의
- encoder는 forward and backward recurrent neural networks (RNN)로 구성 -> 각각 1000개의 hidden units
- decoder는 1000 hidden units
- 둘 다 multilayer network와 single maxout (Goodfellow et al., 2013) hidden layer를 사용(각 target word의 conditional probability를 계산하고자)
Adadelta와 minibatch stochastic gradient descent (SGD) 알고리즘 사용
beam search로 모델 훈련
-> conditional probability를 최대화하는 translation을 찾기 위해서

5 RESULTS

5.1 QUANTITATIVE RESULTS

BLEU score로 측정된 translation performances
모든 cases에서 proposed RNNsearch는 conventional RNNencdec를 능가
RNNsearch의 중요성은 conventional phrase-based translation system (Moses)만큼 높다 => known words로 구성된 문장들만이 고려되었을 때
- Moses uses a separate monolingual corpus (418M words) in addition to the parallel corpora we used to train the RNNsearch and RNNencdec
proposed approach의 motivation 중 하나는, basic encoder–decoder approach에서 fixed-length context vector의 사용
-> 해당 한계는 basic encoder–decoder approach가 long sentences에 대해서 underperform하게 할 것이라고 추측한다
- RNNencdec의 성능은 length of the sentences가 상승함에 따라 급격하게 하락한다
- RNNsearch-30 and RNNsearch-50은 length of the sentences에 견고하다
- 특히 RNNsearch-50은 length가 50 이상이어도 performance deterioration이 없다
- basic encoder–decoder를 넘어선 propsed model의 우수성은 RNNsearch-30이 RNNendec-50을 능가했다는 사실에서 확인가능하다(Table1)

5.2 QUALITATIVE ANALYSIS

5.2.1 ALIGNMENT

제안하는 접근방식은 generated translation과 source sentence에서 단어들 간의 (soft-)alignment를 inspect하는 직관적인 방법을 제시한다
- 이는 Eq.(6)에서 annotation weights αij를 시각화해서 이루어진다
- 각 행은 annotations와 연관된 weights를 나타낸다
- 이를 통해, source sentence에 있는 각 position이 target word를 생성했을 때 더 중요하게 고려되었음을 확인할 수 있다

English와 French words 사이의 alignment가 largely monotonic.
diagonal에는 큰 weights
a number of non-trivial, non-monotonic alignments를 관측
Adjectives and nouns are typically ordered differently between French and English(Fig3 (a))
model correctly translates a phrase
RNNsearch was able to correctly align, jumping over the two words and then looked one word back at a time to complete the whole phrase
soft-alignment의 strength는 hard-alignment와 달리 명확(Fig. 3 (d))
- 예. source phrase [the man] ([l’ homme]으로 번역)
- hard alignment: will map [the] to [l’] and [man] to [homme].
  (번역에 큰 도움X: as one must consider the word following [the] to determine whether it should be translated into [le], [la], [les] or [l’])
- soft-alignment: letting the model look at both [the] and
  [man] -> 이걸로 해결
- 더 나아가, 또 다른 장점으로 it naturally deals with source and target phrases of different lengths, without requiring a counter-intuitive way of mapping some words to or from nowhere
  
  이미지출처

5.2.2 LONG SENTENCES

proposed model (RNNsearch)가 conventional model (RNNencdec) 보다 긴 문장 번역에 있어 훨씬 좋다
-> RNNsearch가 long sentence를 fixed-length vector로 완벽하게 인코딩하는 것을 요구하지 않고, 특정 단어를 surround하는 input sentence의 encoding 부분만 정확하게 하기 때문
test set의 예시1)
- An admitting privilege is the right of a doctor to admit a patient to a hospital or a medical centre to carry out a diagnosis or a procedure, based on his status as a health care worker at a hospital.
- [RNNencdec-50] Un privilege d’admission est le droit d’un m edecin de reconna ´ ˆıtre un patient a l’hopital ou un centre m ˆ edical ´ d’un diagnostic ou de prendre un diagnostic en fonction de son etat ´ de sante.
  -> [a medical center] 전까지 문장 완벽하게 번역
  -> 그 후부터는, 원문장의 본의미를 벗어남
  - 예) [based on his status as a health care worker at a hospital] -> [enfonction de son etat de sant ´ e] (“based on his state of health”)
- [RNNsearch-50] Un privilege d’admission est le droit d’un m edecin d’admettre un patient ´ a un hopital ou un centre m ˆ edical ´ pour effectuer un diagnostic ou une procedure, ´ selon son statut de travailleur des soins de sante´ a` l’hopital.
  -> 인풋 문장의 전체 의미를 보존
test set의 예시2)
- This kind of experience is part of Disney’s efforts to ”extend the lifetime of its series and build new relationships with audiences via digital platforms that are becoming ever more important,” he added.
- [RNNencdec-50] Ce type d’experience fait partie des initiatives du Disney pour ”prolonger la dur ´ ee´ de vie de ses nouvelles et de developper des liens avec les ´ lecteurs numeriques ´ qui deviennent plus complexes.
  -> 대략 30개 단어 생성 후, 실제 의미를 벗어남(밑줄)
  -> 그 부분부터 번역의 질이 낮아짐(일반적인 실수들, 예. 인용 부호의 lack)
- [RNNsearch-50] Ce genre d’experience fait partie des efforts de Disney pour ”prolonger la dur ´ ee´ de vie de ses series et cr ´ eer de nouvelles relations avec des publics ´ via des plateformes numeriques ´ de plus en plus importantes”, a-t-il ajoute.
  -> 사전에 제시된 질적인 결과와 함께, 질적인 관측값은 가설 'RNNsearch 아키텍처는 RNNendec model보다 긴 문장 번역을 더 믿을 수 있게 한다'을 확고히 한다
- Appendix C에 긴 문장에서 번역된 예제들이 더 있다

6.1 LEARNING TO ALIGN

output symbol을 input symbol로 aligning하는 유사한 방식들은 Graves(2013)에 의해서 context of handwriting synthesis로 제시되었다
- Handwriting synthesis
  - model is asked to generate handwriting of a given sequence of characters.
  - mixture of Gaussian kernels 사용해서 the weights of the annotations를 계산(where the location, width and mixture coefficient of each kernel was predicted from an alignment model.)
  - alignment was restricted to predict the location such that the location increases monotonically.
- 우리 연구와의 주된 차별점은.
  - the modes of the weights of the annotations only move in one direction.
  - machine translation의 관점에서, 이는 심각한 한계 -> (long-distance) reordering is often needed to generate a grammatically correct translation (for instance, English-to-German)
- 우리 연구는
  - computing the annotation weight of every word in the source sentence for each word in the translation.
    -> 이 결점은 output sentence가 15-40 단어인 번역 task에서는 심각한 문제가 아님(하지만, 다른 task에 대해서 응용성에 제한이 있을 수 있음

6.2 NEURAL NETWORKS FOR MACHINE TRANSLATION

Bengio et al. (2003)가 neural probabilistic language model을 소개
- neural network를 conditional probability of a word given a fixed number of the preceding words로 model에 사용
- 그 후, machine translation에서 많이 사용
- 하지만, neural network의 역할은 single feature를 existing statistical machine translation에 제공하는 것 혹은, list of candidate translations를 제공된 exisiting sysmtem으로 re-rank 하는 것으로 제한되어있다
예를 들어서, Schwenk (2012) 는 feedforward neural network 사용을 제시
- the score of a pair of source 와 target phrases를 계산해서 -> score를 phrase-based statistical machine translation system에서 additional feature로 score하도록
Kalchbrenner and Blunsom (2013)와 Devlin et al. (2014)는 neural network의 existing translation system에서의 sub-component로서의 우수성을 보고했다
전통적으로, target-side language model로 훈련된 neural network는 a list of candidate translations (see, e.g., Schwenk et al., 2006)를 rescore하거나 rerank하는데 사용되어왔다.
위의 접근방식들로 stateof-the-art machine translation systems을 넘어선 translation performance의 향상을 보여주고자 하였다 -> 하지만 우리는 "more ambitious objective of designing a completely new translation system based on neural networks."에 관심이 더 많다
우리가 이 논문에서 고려한 neural machine translation 부분은 그러므로, earlier works에 대한 radical departure이다
neural network를 existing system에서 part로 사용하는 대신에, 우리의 모델은 스스로 작동하고, source sentence로부터 직접적으로 translation을 생성한다

7 CONCLUSION

neural machine translation의 conventional approach는 encoder-decoder approach라고 불린다
- 전체 input sentence를 fixed-length vecotr로 encode(거기에서 translation이 decoded)
- 우리는 fixed-length context vector의 사용이 긴 문장을 번역하는데 문제일 것이라 추측한다 -> 이전 경험적 연구에 따라서(Cho et al. (2014b), Pouget-Abadie et al.(2014).)
해당 논문에서 우리는 해당 문제를 다루는 novel한 아키텍처를 제시한다
우리는 basic encdoer-decoder를 확장한다
- 어떻게? by letting a model (soft-)search for a set of input words, or their annotations computed by an encoder, when generating each target word
  -> 이로 인해, 모델은 전체 source sentence를 fixed-length cector로 인코딩하는 것에서 자유로워지고,
  -> 모델이 다음 target word의 생성에서 관련있는 정보에만 집중하는 것을 허용한다
  => neural machine translation sysyem의 ability에 있어 longer sentence에서 좋은 결과를 내는데 주요한 긍정적인 영향을 끼친다
  - 기존의 traditional machine translation systems와 달리, translation system의 모든 pieces들(alignment mechansim을 포함한)은 jointly train되어서 log-probability를 통해 정확한 번역을 생성한다
우리는 RNNsearch라 불리는 모델을 English-to-French translation 태스크에서 실험했다
- 실험 결과는 RNNsearch가 문장 길이에 상관없이 conventional encoder-decoder model (RNNendec)를 outperform. (source sentence가 긴 문장일 때 더 robust 하다는 뜻)
- qualitative analysis를 통해, 우리는 RNNsearch로 생성된 (soft-)alignment를 조사했다 -> model은 정확하게 source sentence에서 각 target word를 relevant words 혹은 annotations와 함께 align 할 수 있다(correct translation을 생성했기 때문)
제시된 approach는 존재하는 phrase-based statistical machine translation에 비교할만한 translation performance를 성취했다
- 이것은 제시된 모델이 올해 제안된 점을 고려한다면 놀라운 결과다
- 우리는 여기서 제시된 아키텍처가 더 좋은 machine translation에 있어 유망한 단계이며, 일반적인 natural language를 이해하는데 더 좋은 단계라 믿는다
후속 연구로 남은 한 가지는,
- 알려지지 않았거나 드문 단어를 다루는 방법이다
- 이는 모델이 더 널리 사용되거나, current state-of-the-art machine translation systems의 성능이 모든 문맥에서 match되기를 요할 것이다

fla1512

이전 포스트

Dependency Parsing

다음 포스트

[NLP 논문 리뷰] NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

NLP Study

Abstract

1. Introduction

2 BACKGROUND: NEURAL MACHINE TRANSLATION

2.1 RNN ENCODER–DECODER

3 LEARNING TO ALIGN AND TRANSLATE

3.2 ENCODER: BIDIRECTIONAL RNN FOR ANNOTATING SEQUENCES

(코드)pytorch에서의 구현 이미지 및 코드

(코드)pytorch에서의 구현 Attention

3.1 DECODER: GENERAL DESCRIPTION

(코드)pytorch에서의 구현 Decoder

4 EXPERIMENT SETTINGS

4.1 DATASET

4.2 MODELS

5 RESULTS

5.1 QUANTITATIVE RESULTS

5.2 QUALITATIVE ANALYSIS

5.2.1 ALIGNMENT

5.2.2 LONG SENTENCES

6.1 LEARNING TO ALIGN

6.2 NEURAL NETWORKS FOR MACHINE TRANSLATION

7 CONCLUSION

Dependency Parsing

[논문 리뷰] THE CURIOUS CASE OF NEURAL TEXT DeGENERATION

0개의 댓글

관련 채용 정보

[NLP 논문 리뷰] NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

NLP Study

Abstract

1. Introduction

2 BACKGROUND: NEURAL MACHINE TRANSLATION

2.1 RNN ENCODER–DECODER

3 LEARNING TO ALIGN AND TRANSLATE

3.2 ENCODER: BIDIRECTIONAL RNN FOR ANNOTATING SEQUENCES

(코드)pytorch에서의 구현 이미지 및 코드

(코드)pytorch에서의 구현 Attention

3.1 DECODER: GENERAL DESCRIPTION

(코드)pytorch에서의 구현 Decoder

4 EXPERIMENT SETTINGS

4.1 DATASET

4.2 MODELS

5 RESULTS

5.1 QUANTITATIVE RESULTS

5.2 QUALITATIVE ANALYSIS

5.2.1 ALIGNMENT

5.2.2 LONG SENTENCES

6 RELATED WORK

6.1 LEARNING TO ALIGN

6.2 NEURAL NETWORKS FOR MACHINE TRANSLATION

7 CONCLUSION

Dependency Parsing

[논문 리뷰] THE CURIOUS CASE OF NEURAL TEXT DeGENERATION

0개의 댓글

관련 채용 정보