Incorporating Global Visual Features into Attention-Based Neural Machine Translation

rhye·2023년 2월 1일

논문들

목록 보기

13/13

Calixto, I., Liu, Q., & Campbell, N. (2017). Incorporating global visual features into attention-based neural machine translation. arXiv preprint arXiv:1701.06521.

Abstract

attention-based multi-modal Neural Machine Translation model들을 제안한다 !
- 세가지 모델을 제시할건데, encoder - decoder의 각기 다른 영역에서 visual features들이 통합됨 ~ → global image feature이 pre-trained CNN을 이용해서 추출되고
  1. src sentence의 단어와 통합되거나
  2. encoder hidden state을 initialize하는데 사용되거나
  3. decoder hidden state을 initialize하는데 사용되는 형태
- 어떤 방식이 가장 좋은 성능을 내는지 볼거임 !

+ synthetic multi-modal, multilingual data(augmented data)가 multimodal model의 성능에 어떤 영향을 미치는지도 볼 것임

Introduction

본 논문의 main goal, 기존의 attention-based NMT 모델을 기반으로 visual features를 통합시키는 end-to-end MMT 모델의 구축

🤸🏻 Contributions

encoder - decoder의 각기 다른 영역에서 visual features들이 통합되는 attention-based NMT 모델 구축

synthetic multi-modal & multilingual data가 MMT 모델에 미치는 영향 탐구

image가 NMT 모델에 유용한 정보로 작용한다는 점을 밝혔다 ~

Attention-based NMT

Text-only attention-based NMT

Problem Statement

NMT model, source sentence $X = (x_1, x_2, ..., x_N)$ 와 그 번역문 $Y=(y_1,y_2,...,y_M)$ 이 주어졌을 때 $P(Y|X)$ 를 학습함으로써 X를 Y로 번역하고자 함

Architecture

encoder ; bidirectional RNN with GRU
- forward RNN $\overrightarrow{\Phi}_{enc}$ : src sequence를 순차적으로 읽어들이고 각 encoder time step별로 forward annotation vector ( $\overrightarrow{h}_1, \overrightarrow{h}_2, ..., \overrightarrow{h}_N$ ) 생성
- backward RNN $\overleftarrow{\Phi}_{enc}$ : src sequence를 역방향으로 읽어들이고 각 encoder time step별로 backward annotation vector ( $\overleftarrow{h}_1, \overleftarrow{h}_2, ..., \overleftarrow{h}_N$ ) 생성
  $\overrightarrow{h_i} = \overrightarrow{\Phi}_{enc}(W_x[x_i], \overrightarrow{h}_{i-1}), \\ \overleftarrow{h_i} = \overleftarrow{\Phi}_{enc}(W_x[x_i], \overleftarrow{h}_{i-1})$
- 마지막 annotation vector, forward annotation vector과 backward annotation vector을 통합한 형태 $h_i = [\overrightarrow{h_i}, \overleftarrow{h_i}]$
  
  → 결과적으로 각 src sentence, annotation vector의 sequence $h=(h_1,h_2,...,h_N)$ 로 encoded
decoder ; 기존에 산출된 target word와 src sentence 기반, attention mechanism으로 계산
- 각 time step t에 대해 time-dependent context vector $c_t$ 계산
  : annotation vectors $h$ , decoder의 이전 hidden state $s_{t-1}$ , 이전 time step에서 산출된 target word $\tilde{y}_{t-1}$ 기반으로 계산
- alignment model : single-layer feed-forward network으로 decoder의 time $t$ 에서의 정보가 encoder time $i$ 에서의 정보와 얼마나 연관성이 있는지 score을 계산
  1. encoder time step $i$ 에서의 source annotation vector $h_i$ 과 decoder의 이전 hidden state $s_{t-1}$ 을 활용, expected alignment $e_{t,i}$ 계산
    $e_{t,i} = v_a^Ttanh(U_as_{t-1}+W_ah_i)$
  2. 아래 수식을 거쳐 alignment score이 정규화되고 확률화 됨
    $\alpha_{t,i}=\frac{exp(e_{t,i})}{\sum_{j=1}^{N}exp(e_{t,j})}$
    ( $\alpha_{t,i}$ = 모델의 attention weights )
  3. time-dependent context vector $c_t$ 계산
    $c_t = \sum_{i=1}^N\alpha_{t,i}h_i$
  4. $c_t$ 활용, decoder의 hidden state $s_t$ 계산
    $s_t = \Phi_{dec}(s_{t-1}, W_y[\tilde{y}_{t-1}], c_t)$
    ( $s_{t-1}$ = decoder의 이전 hidden state, $W_y[\tilde{y}_{t-1}]$ = 이전 time step에 산출된 word embedding, $c_t$ = updated time-dependent context vector)
  5. single-layer feed-forward network으로 decoder의 hidden state $s_0$ 초기화 + encoder의 forward RNN( $\overrightarrow{\Phi}_{enc}$ )과 backward RNN( $\overleftarrow{\Phi}_{enc}$ )의 마지막 hidden state를 융합한 값을 feed
    $s_0 = tanh(W_{di}[\overleftarrow{h_1};\overrightarrow{h_N}]+b_{di})$
    ( $W_{di}, b_{di}$ 모두 model parameters)
  - RNN, 고질적인 장기의존성 문제 → decoder hidden state 초기화할 때 첫번째-마지막 토큰 representation을 강하게 강조하는 등의 방법을 쓰려고 한다 ~

attention-based NMT framework의 연장선 + image feature을 통합하기 위해 visual component 추가

extracting image features

pretrained VGG19 network에 image feed → image feature 추출

incorporating images into the attentive NMT framework ; 3 methods

images as source words : $IMG_W$
: using an image as words in the source sentence
- image를 문장의 첫번째 및/혹은 마지막 단어처럼 취급하여 model에 feed하고 attention model로 하여금 언제 image를 참고해야하는지 학습하도록 함
- global image feature $q \in \mathbb{R}^{4096}$ 에 대해,
  $d=W_I^2﹒(W_I^1﹒q+b_I^1) +b_I^2$
  ( $W_I^1 \in \mathbb{R}^{4096𝗑4096}, W_I^2 \in \mathbb{R}^{4096𝗑4096}$ = image transformation matrices,
  $b_I^1 \in \mathbb{R}^{4096}, b_I^2 \in \mathbb{R}^{4096}$ = bias vectors, $d_x$ = source words vector space dimensionality)
- 그렇게 산출된 d를 src word로 사용
  - 첫번째 단어로만 취급하여 학습시키는 경우 $IMG_{1W}$
  - 첫번째, 그리고 마지막 단어로 취급하여 학습시키는 경우 $IMG_{2W}$
- intuition
  - 이미지를 첫번째 단어 취급 → forward RNN을 적용시켰을 때 source sentence와 이미지 융합
  - 이미지를 마지막 단어 취급 → backward RNN을 적용시켰을 때 source sentence와 이미지 융합
images for encoder initialization : $IMG_E$
: using an image to initialize the source language encoder
- 기존의 NMT model, encoder의 hidden state, zero vector로 초기화
  → 이 대신 두개의 새로운 single-layer feed-forward neural network로 forward RNN과 back RNN의 initial hidden state 계산하고자 함
- global image feature $q \in \mathbb{R}^{4096}$ 에 대해,
  $d=W_I^2﹒(W_I^1﹒q+b_I^1) +b_I^2$
  ( $W_I^1 \in \mathbb{R}^{4096𝗑4096}, W_I^2$ = image transformation matrices,
  $b_I^1 \in \mathbb{R}^{4096}, b_I^2$ = bias vectors, $d_x$ = source words vector space dimensionality)
  
  → 단, 이 때 $W_I^2, b_I^2$ 는 d를 encoder의 hidden state 차원과 맞춰줌
- 그렇게 계산된 d를 기반으로, 두 개의 새로운 single-layer feed-forward neural network로 forward RNN과 back RNN의 initial hidden state 계산
  $\overleftarrow{h}_{init} = tanh(W_fd+b_f), \\ \overrightarrow{h}_{init} = tanh(W_bd+b_b)$
  ( $W_f, W_b$ = multimodal projection matrices : image feature d를 encoder의 forward hidden states 및 backward hidden states의 차원으로 변환해주는 역할, $b_f, b_b$ = bias vectors )
images for decoder initialization : $IMG_D$
: using an image to initialize the target language decoder
- decoder hidden state initialization (originally)
  $s_0 = tanh(W_{di}[\overleftarrow{h_1};\overrightarrow{h_N}]+b_{di})$
  : encoder의 forward RNN ( $\overrightarrow{\Phi}_{enc}$ )과 backward RNN ( $\overleftarrow{\Phi}_{enc}$ )의 마지막 hidden state ( $\overrightarrow{h_N}, \overleftarrow{h_1}$ ) concat하는 방식
- decoder hidden state initialization 시 image feature 추가
  $s_0 = tanh(W_{di}[\overleftarrow{h_1};\overrightarrow{h_N}]+W_md+b_{di})$
  ( $W_m$ = multimodal projection matrices : image feature d를 encoder의 forward hidden states 및 backward hidden states의 차원으로 변환해주는 역할 )
- global image feature $q \in \mathbb{R}^{4096}$ 에 대해,
  $d=W_I^2﹒(W_I^1﹒q+b_I^1) +b_I^2$
  ( $W_I^1 \in \mathbb{R}^{4096𝗑4096}, W_I^2$ = image transformation matrices,
  $b_I^1 \in \mathbb{R}^{4096}, b_I^2$ = bias vectors, $d_x$ = source words vector space dimensionality)
  
  → 단, 이 때도 $W_I^2, b_I^2$ 는 d를 decoder의 hidden state 차원과 맞춰줌

Dataset

Flickr30K dataset → 30K images과 각 이미지 별 5개의 영문 description으로 이루어져 있음
- image split
  ( train : val : test = 29K : 1014 : 1K )
- 영문 description 변역한 dataset : $M30K_T, M30K_C$
  - $M30K_T$
    : 한 개의 영문 description을 전문 번역가가 독일어로 번역. 이미지 당 한 개의 EN:GR pair로 구성
  - $M30K_C$
    : 영문 description과 독립적인 독일어 description 수집. 이미지 당 5개의 EN:GR pair로 구성
Train 시 전체 $M30K_T$ training set 사용
+ 여분의 데이터가 모델에게 미치는 영향 연구하기 위해 NMT baseline model에 $M30K_T$ text 데이터만 feed해 번역기 학습 → 이를 back-translation에 활용 → $M30K_C$ text 데이터를 back-translation하여 데이터 증강 ( en → gr → en )

Experimental Setup

Setup

model architecture
- encoder
  - bidirectional RNN with GRU
    ( one 1024D single-layer forward RNN + one 1024D single-layer backward RNN )
- decoder
  - RNN with GRU
    ( + attention mechanism )
word embedding
- src word & tgt word 모두 620D
- dropout = 0.2
image features
- pretrained VGG19 + penultimate fully-connected layer FC7
- dropout = 0.5
hyperparameters
- optimizer = SGD with Adadelta
- batch size = 40

Result

multi30k

$IMG_{2W+D}$ 이외 모든 모델이 기존 MMT 모델보다 성능이 좋음
- encoder - decoder 모두에 image feature 융합하는 것이 생각보다 좋지 못한 성능을 냈다. encoder - decoder 중 하나에만 image feature 넣는 게 오히려 좋음

additional back-translated data

back translation을 통해 증강된 데이터의 효용
- GR → EN : $IMG_{E}$ 모델 성능만 개선
- EN → GR : $IMG_{E}, IMG_{D}$ 모델 성능 개선

Conclusions

🤸🏻 (논문 저술 시점) SOTA NMT model에 이미지 정보를 통합해봤다 ~

기존 text-only 기계번역 모델보다 성능 좋더라 ~

이미지를 단어처럼 취급하는 방식( $IMG_{1W}, IMG_{2W}$ )보다 encoder - decoder 단에서 통합하는 방식( $IMG_{E}, IMG_{D}$ )이 성능 더 좋았다 ~

그렇다고 해서 encoder - decoder 모두에 image feature 융합하는 건( $IMG_{E+D}$ ) 또 성능이 별로임

MMT model, back-translated data 쓰면 성능을 보다 개선시킬 수 있다 ~

rhye

이전 포스트

Incorporating Global Visual Features into Attention-Based Neural Machine Translation

논문들

Abstract

Introduction

Attention-based NMT