Neural Machine Translation By Jointly Learning To Align And Translate

놔놔·2024년 8월 19일

Background: Neural machine translation

RNN Encoder-Decoder

Encoder reads the input sentence, a sequence of vectors x=(x1,...,xTx)\mathbf{x}=(x_1,...,x_{T_x}), into a vector c. The most common approach is to use an RNN such that

ht=f(xt,ht1)c=q({h1,...,hTx})h_t=f(x_t,h_{t-1})\\ c=q(\lbrace h_1,...,h_{T_x}\rbrace)

hth_t is a hidden state at time t, and c is a vector generated from the sequence of the hidden states. f,q are some nonlinear func.

Decoder is often trained to predict the next word yty_t' given the context vector c and all the previously predicted words y1,...,yt1y_1,...,y_{t'-1}
In other words, the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals.

where y=(y1,...,yTy)\mathbf{y}=(y_1,...,y_{T_y}). With an RNN, each conditional probability is modeled as

p(yt{y1,...,yt1},c)=g(yt1,st,c)p(y_t|\lbrace y_1,...,y_{t-1} \rbrace,c)=g(y_{t-1},s_t,c)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yty_t, and sts_t is the hidden state of the RNN.

Learning to align and translate

The new architecture consists of a bidirectional RNN as an encoder and decoder that emulates searching through a source sentence during decoding a translation.

Decoder: General Description

We define each conditional probability into

p(yiy1,...,yi1,x)=g(yi1,si,ci)p(y_i|y_1,...,y_{i-1},x)=g(y_{i-1},s_i,c_i)

where sis_i is an RNN hidden state for time i, computed by

si=f(si1,yi1,ci).s_i=f(s_{i-1},y_{i-1},c_i).

The difference between existing encoder-decoder approach and ours is the probability is conditioned on a distinct context vector cic_i for each target word yiy_i$

The context vector cic_i depends on a sequence of annotations(h1,...,hTx)(h_1,...,h_{T_x}) to which an encoder maps the input sentence. Each annotation hih_i contains information about the whole input sequence with a strong focus on the parts surrounding the ii-th word of the input sequence.

The context vector cic_i is computed as a weighted wum of these annotations hih_i:

ci=j=1Txαijhjc_i=\sum_{j=1}^{T_x}\alpha_{ij}h_j

The weight αij\alpha_{ij} of each annotation hjh_j is computed by

αij=exp(eij)k=1Txexp(eij)\alpha_{ij}=\frac{exp(e_{ij})}{\textstyle\sum_{k=1}^{T_x}exp(e_{ij})}

where eij=a(si1,hj)e_{ij}=a(s_{i-1},h_j), aa is an alignment model which scores how well inputs around position jj and the output at position ii match. The score is based on the RNN hidden state si1s_{i-1} and the jj-th annotation hjh_j of the input sentence.
The probability αij\alpha_{ij},or its associated energy eije{ij}, reflects the importance of the annotation hjh_j with respect to the previous hidden state si1s_{i-1} in deciding the next state sis_i and generating yiy_i.

The decoder decides parts of the source sentence to pay attention to.

This attention mechanism, we relieve the encoder from burden of having to encode all information can be spread throughout the sequence fo annotations.

Encoder: Bidirectional RNN for annotating sequences

A BiRNN consists of forward and backward RNN's. We obtain an annotation for each word xjx_j by concatenating the forward hidden statehj\overrightarrow{h}_j and the backward one hj\overleftarrow{h}_j, i.e., hj=[hj;hj]h_j=\lbrack \overrightarrow{h}_j^{\top};\overleftarrow{h}_j^{\top}\rbrack{}^{\top}
In this way, the annotation hjh_j contains the summaries of both the preceding words and the following words. Due to the tendency of RNNs to better represent recent inputs, the annotation hjh_j will be focused on the words around xjx_j.

profile
5238D8K7

0개의 댓글