[논문 리뷰] Neural Machine Translation by Jointly Learning to Align and Translate 🪟

Jhyunee·2024년 2월 7일

Papers

목록 보기

2/2

Summary ❕

💡 기존 Encoder-Decoder 기반 텍스트 번역 시 발생하는
긴 입력 문장에 대한 성능 저하 원인을 ’fixed-length’에서 찾아,
새로운 모델 구조를 제시

⇒ Variable-length Encoding & Attention Decoding by using
context vector

Review 🗒️

0. Abstract

Previous : Neural Machine Translation with encoder-decoder
Flow : Source 문장 → encoder → “fixed-length” vector → decoder → output
이때, using “fixed-length” : bottleneck 유발!

In this paper ; Automatically soft-search

예측할 target word와 연관성이 있는 입력 문장의 파트를 탐색, w/o giving hard segmentation explicitly.

1. Introduction

Traditional translation : Phase-based translation system

Sub-components tuned seperately

Previous Neural translation : Train a single, large network, sentence-unit

입력 문장의 모든 정보를 고정된 길이 “fixed-length”에 압축해야 하는 문제
- 긴 문장에 성능 저하 유발

In this paper : "context vector" 이용

Input 문장이 길어질 때 생기는 정보 압축 문제에 대한 해결책 제시 - 매 단어 생성마다 Automatically soft-search

Most relevant information이 집중된 위치를 찾아 ⇒ context vector 생성
- 해당 위치 정보(Source position 정보)를 포함
- Contect vector + 이전 예측 단어들 → target word prediction!

Mechanism : ✔️

즉, Encoder는 입력 문장을 sequence of vectors로 변환하고,
Decoder는 인코더의 출력 벡터 중, 필요한 부분(subset)을 골라서 -Context vector를 이용하여- 사용한다.
이를 통해 다음과 같은 효과를 얻을 수 있다.

성능 향상
언어적으로 더 적합한(자연스러운) 변환

2. Background : Neural Machine Translation

Translation task == 조건부 확률을 최대화 하는 문장을 탐색하는 작업이다.
: Conditional probability of $Y$ , given a source sentence $X$ = $P(Y \ | \ X)$

2.1 RNN Encoder-Decoder

Encoder :

입력 문장을 받아 Context vector $c$ 를 다음과 같이 생성한다.

Input sentence → Encoder → Sequence of vectors
$X = (x_1, …, x_{T_x})$ → Encoder → $c$ (variable-length, contect vector)

RNN :
$(1) \ \ \ \ \ \ h_t = f(x_t, h_{t_1}) \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ c = q(\{h_1, ..., h_{T_x}\})$

where $h_t$ ; hidden state at time t
$c$ ; context vector generated from hidden states
$f, q$ ; some nonlinear functions
ex.) $f$ can be LSTM

Decoder :

입력을 받아 다음 단어를 예측하도록 학습된다.

Trained to predict the next word $y_t$
(given $c$ & previously predicted words $({y_1, …, y_{{t}\prime-1}})$ )

즉, Decoder defines a probability over the translation $Y$ , 해당 단어로 변환(예측)될 조건부확률을 아래와 같이 정의한다.
$(2) \ \ \ \ \ \ \ p(Y) = \prod^{T}_{t=1}p(y_t \ | \ {y_1, ..., y_{t-1}}, \ c)$
$Y = (y_1, …, y_{T_y})$ , RNN :
$(3) \ \ \ \ \ \ \ p(y_t \ | \ {y_1, ..., y_{t-1}}, \ c) = g(y_{t-1}, s_t, c)$

Where g ; nonlinear, potentially multi-layered, $y_t$ 확률 계산 함수
$s_t$ ; hidden state of RNN

3. Learning to Align and Translate

Bidirectional RNN Encoder & Searching Decoder

3.1 Decoder : General Description

Conditional probability : We define $(2)$ as

(4) \ \ \ \ \ \ \ \ p(y_i \ | \ {y_1, ..., y_{i-1}}, \ X) = g(y_{i-1}, s_i, c_i)

Where $s_i$ ; RNN hidden state of time i,

s_i = f(s_{i-1}. y_{i-1}, c_i)

식 $(2)$ 와의 차이점 : $c_i$
각 target word $y_i$ 마다 조건부확률을 정의하는 $c$ 가, time $i$ 마다 개별적으로 지정되어 있다.

Context vector : $c_i$

$c_i$ 는 인코더의 출력(mapped input sentence) - $(h_1, …, h_{T_x})$ - 에 의해 아래와 같이 결정된다.

이때, 각각의 $h_i$ 는 전체 문장(input sequence)에 대한 정보를 가진다.
(Containing strong focus on suroundings of $i$ -th word.)

$c_i$ 는 다음과 같이 각 $h_j$ 에 대한 가중합(weighted sum)으로써 계산되며,

$c_i = \sum^{T_x}_{j=1}\alpha_{ij}h_j$

가중치 $\alpha_{ij}$ 는 $x_j$ 로부터 $y_i$ 가 예측될 확률이다.

즉, $y_i$ 를 생성할 때 $h_j$ 의 'importance(중요도)'를 의미하며, 그것을 가중치로써 $c_i$ 에 반영한다. ( $\alpha_i, \ e_{ij}$ reflectes the importance of the annotation $h_j$ )

Context vector $c_i$ 는, $j$ 위치의 단어 $x_j$ 가 $i$ 번째 출력 단어 $y_i$ 에 대한 relevent information을 얼마나 가지는지의 정보를 담기 때문!

⇒ 위와 같은 방법으로, 1.의 Mechanism에 언급된 바와 같이,
Decoder가 인코더의 출력 벡터 중, 중요한(필요한) 부분(subset)을 골라서 사용하는 것처럼 보이게 한다.
- 그리고 가중치 $\alpha_{ij}$ 는 다음과 같이 계산된다.
- 이때, $e_{ij} = a(s_{i-1}, h_j)$ 이다. ( $a$ is a feedforward neural network) $\alpha_{ij} = {\text{exp}(e_{ij}) \over \sum^{T-x}_{k=1} \text{exp}(e_{ik}) }$
이를 통해 Decoder의 Attention mechanism이 구현되는데,
입력 문장의 어떤 부분에 집중할지 Decoder가 결정(하는 것처럼 보임)하게 된다.

✔️ Encoder가 입력된 전체 문장을 fixed-length로 압축하는 부담을 덜어준다.

3.2 Encoder : Bidirectional RNN for Annotating Sequences

BiRNN :

각 단어가 자기 자신의 이전 단어들에 대한 정보뿐만 아니라, 이후 단어들에 대한 정보까지 얻을 수 있게 하기 위함이다.
(For summarizing not only the preceding words, but also the following words.)

Forward RNN $\overrightarrow{f}$ : reads input from ( $x_1$ to $x_{T_x}$ )
- Calculates forward hidden states = $(\overrightarrow{h_1}, …, \overrightarrow{h_{T_x}})$
Backward RNN $\overleftarrow{f}$ : reads input from ( $x_{T_x}$ to $x_1$ )
- Calculates backward hidden states = $(\overleftarrow{h_1}, …, \overleftarrow{h_{T_x}})$
Concatenate $h_j = [\ \overrightarrow{h_j} \ ; \ \overleftarrow{h_j}^T]^T$

⇒ 결과적으로 $h_j$ 는 이전, 이후 문맥에 대한 모든 정보를 가진다.

Jhyunee

좋아하는 것 많은 사람

이전 포스트