[논문 리뷰] Multimodal Learning with Transformers: A Survey

Hayun Lee·2023년 9월 11일

multi-modal transformer

논문 리뷰

목록 보기

8/11

link: https://arxiv.org/abs/2206.06488
IEEE TPAMI 2023

Abstract

멀티모달 데이터를 지향하는 transformer 기술에 대한 포괄적인 조사를 제시
멀티모달 학습 배경, transformer 생태계 및 멀티모달 빅데이터 시대, 기하학적 위상학적 관점에서 vanilla/vision/multi-modal transformer에 대한 체계적 검토, 멀티모달 pre-training 및 특정 멀티모달 작업을 위한 멀티모달 transformer application에 대한 검토, 멀티모달 transformer model 및 application이 공유하는 공통 과제 및 설계에 대한 요약, 커뮤니티의 개방형 문제 및 잠재적 연구 방향에 대한 논의

1. INTRODUCTION

transformer를 사용한 멀티모달 학습에 초점 맞춤
transformer에 대한 입력은 하나 이상의 토큰 시퀀스와 각 시퀀스의 속성을 포함할 수 있으므로 아키텍처 수정 없이 자연스럽게 Multi Modal Learning(MML)을 허용할 수 있음
self-attention의 입력 패턴을 제어함으로써 inter-modal correlation과 learning per-modal specificity를 학습하는 것이 간단하게 실현될 수 있음
이 논문은 transformer 기반 multimodal machine learning의 상태에 대한 포괄적 검토

2. BACKGROUND

2.1 Multimodal Learning (MML)

초기 multimodal application은 인간 사회의 핵심
최근 대규모 언어 모델과 multimodal 파생 모델의 성공은 multimodal foundation model에서 transformer의 잠재력을 더욱 보여줌

2.2 Transformers: a Brief History and Milestones

vanilla transformer
- self-attention mechanism
- 원래 제안된 시퀀스별 representation learning을 위한 획기적인 모델로 다양한 NLP 작업에서 최첨단을 달성
- 이후 BERT, BART, GPT, Longformer, transformer-XL, XLNet 등 파생 모델 제안됨
visual domain
- 일반적인 파이프라인: CNN features + standard transformer encoder
- 낮은 해상도로 크기 조정하고 1D sequence로 재구성해 원시 이미지를 사전처리하여 BERT 스타일로 pretraining 달성
ViT(Vision Transformer)
- transformer의 encoder를 이미지에 적용하여 end-to-end 솔루션에 기여
- low-level tasks, recognition, detection 등 다양한 CV task에 적용됨
VideoBERT
- transformer를 multimodal에 도입한 최초 사례
- 이후 많은 transformer based multimodal pretraining model이 연구됨
CLIP
- pretraining된 모델이 zero-shot recognition을 해결할 수 있도록 하는 retrieval task로 classification을 변환하기 위해 multimodal pretrained 활용

2.3 Multimodal Big Data

Conceptual Captions, COCO, VQA, Visual Genome 등 많은 대규모 멀티모달 데이터셋 제안됨
최근 출시되는 멀티모달 데이터셋 중 새로운 트렌드
- 데이터 규모 더 큼
- 더 많은 양식으로 시각, 텍스트, 오디오 외에도 Pano-AVQA 등 더 다양한 양식의 데이터 등장
- 더 많은 응용에서의 시나리오 연구됨 - 자율주행, 금융 데이터 등
- 더 어려운 작업 - 밈의 혐오 발언과 같은 추상적인 작업 제안됨
- instructional cideo가 인기 얻고 있음

3. TRANSFORMERS

vanilla self-attention(transformer)는 위상 기하학적 기하학 공간에서 fully-connected graph로 모델링할 수 있음
다른 deep network와 비교하여 transformer는 본질적으로 보다 일반적이고 유연한 모델링 공간을 가지고 있음

3.1 Vanilla Transformer

encoder-decoder 구조를 가지며 transformer 기반 연구의 원조
토큰화된 입력을 받으며 encoder/decoder는 transformer layer/block에 의해 stack됨
각 블록에는 두 개의 sub layer, Multi Head Self Attention(MHSA) layer와 position별 Fully connected Feed Forward Network(FFN)이 있음
back propagation을 위해 gradient, MHSA, FFN 및 residual connection, normalization layer를 사용
post-normalization과 pre-normalization이라는 중요한 미해결 문제가 있음
- 원래 vanilla transformer는 각 MHSA와 FFN sub-layer에 대해 post-normalization을 사용하는데, 수학적 관점에서 고려할 때 pre-normalization이 더 합리적임
- 이는 projection 전에 normalization이 수행되어야 한다는 행렬 이론의 기본 원리와 유사함 → 이론적 연구와 실험적 검증을 통해 추가로 연구되어야 함

3.1.1 Input Tokenization

tokenization vanilla transformer는 원래 기계 번역을 위해 seauence-to-sequence model로 제안되었으므로 vocabulary sequence를 입력으로 사용하는 것이 간단함
원래 self-attention은 양식에 관계 없이 임의의 입력을 fully-connected graph로 모델링 가능
vanilla transformer와 variant transformer는 모두 토큰화된 시퀀스를 사용하며 각 토큰은 그래프의 노드로 간주
position embedding은 position 정보를 유지하기 위해 token embedding에 추가됨
vanilla transformer에서는 sin, cos func를 사용해 position embedding을 생성
input tokenization 장점
- 보다 일반적인 접근 방법
- concatenation/stack, weighted summation 등으로 input을 구성하는 보다 유연한 접근 방식
- task-specific customized tokens(masked language modelling의 경우 [MASK], classification의 경우 [class])와 호환됨
position embedding to transformers
- 시간적/공간적 정보를 제공하기 위한 일종의 spatial information을 제공하기 위한 일종의 feature space의 implicit coordinate basis of feature space로 이해될 수 있음
- position embedding은 optional
- self-attention의 수학적 관점(scaled dot-product attention)에서 고려하면 position embedding 정보가 누락된 경우 words(in text), node(in graphs)의 위치에 attention이 변하지 않음 ⇒ 대부분의 경우 transformer에는 position embedding이 필요

3.1.2 Self-Attention and Multi-Head Self-Attention

vanilla transformer의 핵심 구성 요소는 scaled dot product attention이라고도 하는 self attention(SA) 작업
- input sequence가 주어지면 self-attention을 통해 각 요소가 다른 모든 요소에 attention할 수 있으므로 self-attention은 input을 fully-connected graph로 encoding ⇒ vanilla transformer의 encoder는 fully-connected GNN encoder로 간주할 수 있으며 transformer family는NonLocal Network와 유사하게 non-local ability of global perception을 가지고 있다
Masked Self-Attention(MSA)
- transformer의 디코더가 상황별 종속성을 학습하고 위치가 subsequent position에 attending하는 것을 방지하려면 self-attention 수정 필요
- 기본적으로 MSA는 transformer model에 additional knowledge를 주입하는 데 사용됨
Multi-Head Self-Attention (MHSA)
- 여러 개의 self-attention sub-layer를 병렬로 쌓을 수 있으며 연결된 출력은 projection matrix(W)에 의해 concatenated 되어 Multi-Head Self-Attention이라는 구조를 형성
- MHSA는 모델이 multiple representation sub-spaces의 정보에 공동으로 주의를 기울이는 데 도움됨

3.1.3 Feed-Forward Network

출력은 non-linear activation이 있는 연속적인 linear layer로 구성된 position-wise Feed-Forward Network(FFN)을 통과
일부 transformer 문헌에서는 FFN을 MLP(Multi-Layer Perceptron)라고도 함

3.2 Vision Transformer

ViT에는 input image를 고정된 크기의 patches로 분할해야하는 이미지별 입력 파이프라인이 있음
linearly embedded layer를 통과하고 position embedding을 추가한 후 모든 패치별 시퀀스는 standard transformer encoder로 encoding됨

3.3 Multimodal Transformer

3.3.1 Multimodal Input

transformer family는 일반적인 graph neural network의 한 유형으로 공식화할 수 있는 일반적인 아키텍처
특히 self-attention은 global(non-local) 패턴에 주의를 기울여 각 입력을 fully-connected graph로 처리 가능
tokenization과 embedding processing이 주어지면 사용자는 데이터를 변환기에 입력하기 전에 1. 입력 토큰화 2. 토큰을 나타내는 임베딩 공간 선택
가장 일반적인 fusion은 여러 embedding을 token 방식으로 합산하는 것

3.3.2 Self-Attention Variants in Multimodal Context

multimodal transformer에서 cross-modal interactions(fusion, alignmnet)는 기본적으로 self-attention 및 그 변형에 의해 처리됨
self-attention 설계 관점에서 early summation(token-wise, weighted), early concatenation, hierarchical attention(multi-stream to one-stream), hierarchical attention(one-stream to multi-stream), cross-attention, cross-attention to concatenation을 살펴본다
early summation
- 간단하고 효과적인 multimodal interaction
- 여러 modalities의 token embeddings를 각 token position에서 가중치 합산 후 transformer layer에서 처리
  - 계산이 복잡하지 않으나 수동으로 가중치 설정해야 함
early concatenation(all-attention/CoTransformer)
- 여러 modalities의 token embedding sequences가 트렌스포머 레이어에 연결되고 입력됨
- 모든 multimodal token positions는 전체 시퀀스가 되며, 각 모달의 위치는 다른 모달의 context를 조정하여 인코딩
- VideoBERT
  - 초기에 연결하여 융합된 모달 사용 가능
  - 연결 후 시퀀스가 길어질수록 계산량 증가
Hierarchical Attention(multi-stream→one-stream)
- Transformer layers는 cross-modal interactions에 attend를 기울이기 위해 계층적으로 결합 가능
- multimodal inputs이 독립적인 Transformer streams으로 인코딩되고, 해당 outputs이 다른 Transformer에 의해 연결되고 융합됨
Hierarchical Attention(one-stream→multi-stream)
- concatenated multimodal inputs이 두 개의 개별 트랜스포머 streams이 뒤따르는 공유 단일 streams 트랜스포머에 의해 인코딩되는 hierarchical attention
- cross-modal interactions과 동시에 uni-modal representation의 독립성 유지
  - InterBERT
Cross-Attention(CoAttention)
- 2-stream 트랜스포머의 경우 Q(Query) 임베딩이 cross-tream 방식으로 교환(exchanged)/swapped되면 cross-modal interactions도 감지할 수 있음)
- VilBERT에서 처음 제안되었음
- cross-attention 각 모달을 서로 attention 하고, 계산 복잡도가 비교적 적음
- cross-modal attention을 수행하지 못하므로, 전체 context를 잃음: two-stream cross-attention은 cross-modal interaction을 학습할 수 있지만, 각 모달 내부의 self-attention은 없음
Cross-Attention to Concatenation
- cross-attention의 two streams은 global context를 모델링하기 위해 또 다른 트랜스포머에 의해 더 연결되고 처리될 수 있음
- 이러한 종류의 계층적 cross-modal interaction은 cross-attention의 단점을 보완할 수 있음

3.3.3 Network Architectures

기본적으로 다양한 멀티모달 transformer는 앞서 언급한 self-attention 변형인 internal multimodal attention으로 인해 작동
Fig. 2.에서 볼 수 있듯 attentions는 embedded된 multimodal transformer의 외부 네트워크 구조를 결정
single-stream: early summation/early concatenation
multi-streams: cross-attention
hybrid-streams: hierarchical attention/cross-attention to concatenation

4. APPLICATION SCENARIOS

4.1 Transformers for Multimodal Pretraining

4.1.1 Task-Agnostic Multimodal Pretraining

Vision Language Pretraining(VLP)
- 이미지+언어 / 비디오+언어
Speech can be used as text
- ASR 기술에서 multi-modal context는 음성 인식 도구를 사용해 음성을 텍스트로 변환 가능(videoBERT)
well-aligned multimodal data에 지나치게 의존
- 대부분의 transformer 기반 multimodal pretraining은 self-supervised 방식으로 작동하지만 잘 정렬된 multimodal sample pairs/tuple에 지나치게 의존
- 일반적으로 시각적 단서/콘텐츠 및 음성 단어의 확률이 더 높음
- cross-modal alignment를 cross-modal supervision으로 사용하는 것은 대규모 응용 프로그램의 경우 비용이 많이 듦 → pretraining corpora로 짝이 없거나 정렬되지 않은 multimodal data를 사용하는 방법은 여전히 잘 연구되지 않음
- 최근 일부 논문은 cross-modal cross-modal supervision을 사용하여 transformer가 cross-modal interactions를 학습하도록 함
Most of the existing pretext tasks transfer well across modalities.
- text영역의 Masked Language Modeling(MLM)은 오디오 및 이미지에 적용되어 있었음
- 텍스트 도메인과 비디오 도메인의 Frame Ordering Modeling(FOM)은 동일한 아이디어를 공유
모델 구조로서는 세 가지 범주로 나뉨
- self-attention의 변형을 기반으로
- single-stream, multi0stream, hybird-stream
cross-modal interactions는 pretraining pipelines의 다양한 구성 요소/레벨 내에서 수행될 수 있음
- transformer 기반 multimodal pretraining의 핵심은 transformer(encoder 포함, decoder 미포함)를 구동하여 cross-modal interaction을 학습하는 것
- 기존 사전학습 방식에서는 modal간 상호작용이 유연해 사전학습 파이프라인의 다양한 구성요소/레벨 내에서 수행할 수 있었음
- transformer 기반 multimodal pretraining은 tokenization, transformer representation, objective supervision 요소로 구성됨
- self-attention은 임의의 모달리티로부터 임의의 토큰을 그래프의 노트로 임베딩하는 것을 모델링하므로, 기존의 사전훈련 파이프라인은 모달리티 고유의 목적을 고려하지 않는 한 일반적으로 모달리티에 걸쳐 독립적으로 전송될 수 있음

4.1.2 Task-Specific Multimodal Pretraining

down-stream task에 구애받지 않는 사전학습은 선택사항
down-stream task specific pretraining도 널리 연구됨
왜냐면
- 기존 기술의 제한으로 인해 모든 다양한 down-stream 응용에서 작동하는 매우 보편적인 네트워크 architecture, pretext task 및 corpora(말뭉치) 셋을 설계하는 것은 극히 어려움
- 다양한 down-stream 응용 간에는 격차가 있어 pretraining에서 down-stream 응용으로 전환하기 어렵

4.2 Transformers for Specific Multimodal Tasks

최근 연구에 의하면 transformer는 기존 및 새로운 discriminative 응용에서 모두 다양한 multimodal inputs를 encoding할 수 있음을 보여줌
- RGB & optical flow, RGB & depth, RGB & point cloud, RGB & LiDAR 등
transformer는 single-modality to single-modality를 포함한 다양한 multimodal generative task에도 기여
- RGB to sene graph, graph to graph, knowledge graph to text 등

5. CHALLENGES AND DESIGNS

5.1 Fusion

일반적으로 MML transformer는 주로 입력(early fusion), intermediate representation(middle fusion), prediction(late fusion)의 세가지 수준에서 여러 양식에 걸쳐 정보를 융합
일반적인 초기 fusion 기반 MML transformer모델은 one-stream architecture라고도 알려져 있으며, 최소한의 architecture 수정으로 BERT의 장점 채택 가능
one-stream model의 주요 차이점은 variant masking techiques과 함께 problem-specific modalities를 사용한다는 것

5.2 Alignment

cross-modal alignment는 다양한 현실 세계 멀티모달 응용의 핵심
- transformer based cross-modal alignment: 다중 화자 비디오의 화자 위치 추정, speech translation, text-to-speech alignment 등
대표적 방법은 쌍을 이루는 샘플에 대한 contrastive learning을 통해 두 가지 modalities를 common representation space에 매핑하는 것
- 그러나 이러한 모델은 크기가 크고 훈련 데이터 최적화에 비용이 많이 듦
- 결과적으로 후속 작업에서는 다양한 downstream task를 처리하기 위해 사전훈련된 모델을 주로 활용
- 이러한 alignment model은 특히 prompt engineering을 통한 이미지 분류를 위한 zero-shot transfer particularly를 갖추고 있음
- 까다롭고 세분화된 작업(object detection, VQA, instance retrieval 등)에서, region level alignment를 적용
  - 세분화된 alignment는 cost 더 커짐
  - 최근에는 random sampling, learning concept dictionary, uniform masking 등

5.3 Transferability

다양한 데이터셋 및 응용에서 모델은 transfer하는 방법에 대한 것과 관련된 transformer based multimodal learning의 주요 과제
data augmentation, adversarial perturbation 전략은 multimodal transformer가 일반화 능력을 갖추는 데 도움 됨(ex. VILLA, CLIP 등)
transformer는 overfitting 가능성이 있어 최근 일부 사례에서는 noise가 없는 데이터셋에서 훈련된 oracle 모델을 실제 데이터셋으로 transfer하는 방법을 활용
cross-task gap은 transfer의 또 다른 어려운 점
- multimodal dataset을 사용하여language pretrained model을 finetune하는 것은 어려움
- 실제 응용에서는 multi modal pretrained transformer가 modalities를 누락하는 문제로 인해 inference 단계에서 uni-modal 데이터를 처리해야 하는 경우가 있음
  - knowledge distillation을 사용하여 해결
    - transformer에서 multimodal → uni-modal attention
    - transformer에서 여러 uni-modal → transformer encoder로 share
- transformer 기반 multimodal learning의 transferability를 위해 cross-lingual gap도 고려해야 함(English → non-English)

5.4 Efficiency

multimodal transformer가 겪는 효율성 문제
- 모델 매개변수 용량이 커 데이터가 부족하여 대규모 training dataset에 의존
- self-attention으로 인해 입력 시퀀스 길이에 따라 2차적으로 증가하는 시간 및 메모리 복잡성
최근에 사용되는 다양한 아이디어
- knowledge distillation
  - 훈련된 larger transformer를 smaller transformers로 정제
  - early concatenation based transformer( $\mathcal{O}((N_{(A)}+N_{(B)})^2)$ ) → faster one(independently dual branch transformer)( $\mathcal{O}(N_{(A)}^2)$ )
- 모델 단순화 및 압축
  - pipeline 단순화 위해 구성 요소 제거
    - VLP transformer
      - 2-stage pipeline은 object detector가 필요해서 비용 많이 듦 → VLP, ViLT와 같이 convolution 없는 방식으로 시각적 입력 처리
    - DropToken
      - 훈련 중 무작위로 토큰의 일부 삭제
    - weight 공유
- Asymmetrical network structures
  - 다양한 양식에 대해 다양한 모델 용량과 계산 크기를 적절하게 할당하여 매개변수 저장
- training sample의 utilization 향상
  - 다양한 granularity(세분성)에서 더 적은 수의 샘플을 최대한 활용하여 단순화된 LXMERT 교육
    - CLIP을 훈련하는 데 각각의 modality에서 self-supervision/multi-view supervision across modalities/다른 유사한 쌍에서 nearest-neighbour supervision
- compressing and pruning model
  - multimodal transformer에서 optimal sub-structures/sub-networks를 탐색
- self-attention의 복잡성 최적화
  - $\mathcal{O}(N^2)$ 복잡성을 최적화
    - 2차 복잡도를 $\mathcal{O}(N\sqrt N)$ 으로 줄이기 위해 attention matrix의 sparse factorization(희소 인수분해) 제시
    - TransformerLS
- self-attention 기반 multimodal interaction/fusion의 복잡성 최적화
  - early concatenation 기반의 multimodal interaction을 개선하기 위해 FSN(Fusion Bottleneck)을 통한 Fusion vias Attention Bottleneckㅇㄹ 제안
- 기타 전략
  - greedy strategy: 시퀀스 길이의 오름차순으로 두 개의 인접한 뷰의 모든 쌍 사이의 정보를 순차적으로 융합

5.5 Robustness

large-scale corpora에 pretrained된 multimodal transformers는 SOTA를 달성하지만 robustness는 여전히 불명확
robustness를 이론적으로 분석하는 방법, robustness를 개선하는 방법 연구해야 함
최근 평가
- transformer components/sublayer가 robustness에 어떻게 기여하는지 연구 및 평가
- but 아직 transformer family를 분석하기 위한 이론적 도구 부족함
- 최근 robustness 분석하는 일반적인 관행은 주로 데이터 셋 간 평가, perturbation 기반 평가 등
- 최근의 시도
  - augmentation과 adversarial traning 기반 전략
  - fine-grained loss func

5.6 Universalness

최근 많은 연구가 다양한 양식과 멀티모달 작업을 처리하기 위해 가능한한 통합된 파이프라인을 사용하는 방법을 연구
최근 시도
- uni-modal과 multimodal input/task를 위한 파이프라인 통합
- multimodal 이해와 generation을 위한 파이프라인 통합
  - 일반적으로 multimodal transformer pipeline의 경우 understanding과 discrimivative task는 encoder만 필요
  - 반면 generation/generative task는 encoder, decoder 모두 필요
  - 기존 시도는 multitasking learning을 사용하여 understanding과 generation workflow 결합하여 multitask loss func에 의해 공동으로 학습되도록 함
    - E2E-VLP: encoder+decoder
    - UniVL, CBT: separate encoders + cross encoder + decoder
    - VLP: single unified/combined encoder-decoder
    - two-stream decoupled design
- CLIP
  - 작업 자체를 통합하고 변환하면 zero-shot recognition을 retrieval로 변환
위 시도들의 과제 및 bottlenecks
- universal model은 보편성과 비용 간의 균형 고려
- multi-task loss func는 학습의 복잡성 증가시킴

5.7 Interpretability

transformer가 multimodal learning에서 왜 잘 수행되는지와 그 방법 조사

6. DISCUSSION AND OUTLOOK

MML 모델을 다양한 모달리티에서 탁월한 성과를 내도록 디자인하는 것은 어려운 도전임
2 stream architecture는 cross-modal retrieval 작업에 효율적이지만, 작업-중립 MML 아키텍처를 설계하는 것은 여전히 어려움
최첨단 기술과의 간극이 존재하며, 모든 작업에 적용 가능한 모델 디자인을 탐구하기 위한 연구가 진행 중
모델 디자인과 계산 비용 등 여러 요인을 고려해야 하며, 산업 연구팀들이 이 분야에서 활발한 연구를 이끌 것으로 예상
transformer는 implicit knowledge를 인코딩할 수 있음
multi-head는 모델의 representation 능력을 더욱 향상시킬 수 있는 여러 모델링 sub-space를 제공
Transformer는 본질적으로 Non-Local 패턴을 인식하는 Global Aggregation의 성격을 가지고 있음
큰 모델 용량 덕분에 Transformer 모델은 대규모 말뭉치에 대한 효과적인 사전 학습을 통해 까다로운 도메인 gap 및 shift를 더 잘 처리
transformer는 입력을 테이블 및 SQL과 같은 더 많은 양식과 본질적으로 호환되는 그래프로 나타낼 수 있음
시리즈 및 시퀀스 패턴(시계열) 모델링의 경우 Transformer는 훈련 및/또는 추론의 병렬 계산 덕분에 RNN 기반 모델에 대해 더 나은 훈련 및 추론 효율성 가짐
토큰화를 사용하면 Transformer가 multimodal input을 유연하게 구성할 수 있음

7. CONCLUSION

이 survey는 transformer를 사용한 multimodal machine learning에 중점을 둠

Hayun Lee

이전 포스트

[논문 리뷰] Adashare: Learning What To Share For Efficient Deep Multi-Task Learning

다음 포스트

[논문 리뷰] Multimodal Learning with Transformers: A Survey

논문 리뷰

Abstract

1. INTRODUCTION

2. BACKGROUND

2.1 Multimodal Learning (MML)

2.2 Transformers: a Brief History and Milestones

2.3 Multimodal Big Data

3. TRANSFORMERS

3.1 Vanilla Transformer

3.1.1 Input Tokenization

3.1.2 Self-Attention and Multi-Head Self-Attention

3.1.3 Feed-Forward Network

3.2 Vision Transformer

3.3 Multimodal Transformer

3.3.1 Multimodal Input

3.3.2 Self-Attention Variants in Multimodal Context

3.3.3 Network Architectures

4. APPLICATION SCENARIOS

4.1 Transformers for Multimodal Pretraining

4.1.1 Task-Agnostic Multimodal Pretraining

4.1.2 Task-Specific Multimodal Pretraining

4.2 Transformers for Specific Multimodal Tasks

5. CHALLENGES AND DESIGNS

5.1 Fusion

5.2 Alignment

5.3 Transferability

5.4 Efficiency

5.5 Robustness

5.6 Universalness

5.7 Interpretability

6. DISCUSSION AND OUTLOOK

7. CONCLUSION

[논문 리뷰] Adashare: Learning What To Share For Efficient Deep Multi-Task Learning

[논문 리뷰] A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models

0개의 댓글