[Multimodal_03] Multimodal Abstractive Summarization for How2 Videos(ACL, 2019)

fla1512·2023년 3월 9일

Abstract

1 Introduction

2 Multimodal Abstractive Summarization

3 Summarization Models

4 Evaluation

5 Experiments and Results

6 Conclusions

7 논문 후기

Multimodal Study

목록 보기

3/4

2023/03/10일 기준 59회 인용된 논문이다

Abstract

open-domain videos에 대한 abstractive summarization 연구

summarization의 종류: 참고
extractive summary: passage에 있는 것들(보통 문장)을 뽑아내서 summary 구성
abstractive summary: 새로운 단어(novel, 원 텍스트에 있지 않은 단어)를 생성하여 summary 구성, 보통 사람한테 summary하라고 하면 나오는 유형
지금까지 연구되온 text news summarization과 달리,
- 목표는 video, audio transcript(text)를 fluent textual sumary 형태로 제공하기
본 연구에서 입증한 바
- how a multi-source sequence-to-sequence model with hierarchical attention can integrate information from different modalities into a coherent output, compare various models trained with different modalities and present pilot experiments on the How2 corpus of instructional videos.
- propose a new evaluation metric (Content F1) for abstractive summarization task that measures semantic adequacy rather than fluency of the summaries, which is covered by metrics like ROUGE and BLEU.

1 Introduction

Background

비디오 공유 플랫폼의 인기가 급상승하면서 온라인에서 사용자가 생성하고 공유하는 구조적인 비디오(user-generated instructional videos shared online)에 대한 인기도 올라갔다
온라인에서 videos가 풍부해져서 -> 관련 있는 비디오를 찾고 회수하는 효율적인 방법에 대한 수요가 늘어났다

기존 연구 Limitation

많은 cross-modal search applications들은 text 관련에 의존한다
- 예) 관련된 내용을 찾기 위한 description이나 title을 가진 비디오들
하지만 비디오들은 그런 큰 text data가 없기도 하고, 그런 정보들이 비디오 관련해서 명확한 정보를 준다는 보장이 없다

limitation 극복하기 위한 본 논문의 Approach

본 연구에서는 이를 해결하고자, 비디오에 대한 짧은 텍스트 요약문을 생성하고자 한다
- 요약문은, 비디오의 가장 중요한 내용을 묘사한다
본 연구를 통해 1) 사용자들은 더 좋은 문맥의 정보와 사용자 경험을 얻을 수 있고, 2) 사용자 참여도(user engagement) 또한 올라갈 수 있다(관련 있는 비디오를 잘 회수해준다는 점에서 주목을 이끌기에)
summarization은 문서에서 내용의 짧은 버전을 만들어 내는 것이다,
- summarization의 정보를 보존하는 것은 이미지와 텍스트 모두에서 연구되어 왔다
  - 텍스트: 자동 텍스트 요약
    - NLP에서 많이 연구되는 분야로, 텍스트 문서가 주어지면 텍스트 요약본을 제공하는 것이 목표다
    - 사용자들이 긴 문서를 이해할 수 있도록
    - 지금까지 대다수의 텍스트 요약은 뉴스 같은 분야에서 single-document 요약으로 이루어졌고, 가끔 multi 관련 연구도 있었다
  - 이미지: visual documents(이미지, 비디오) => 비디오 요약
    - 텍스트 + visual 모달리티의 혼합이다
    - 비디오의 내용을 요약해서 -> 텍스트 문서를 얻는다
    - 멀티모달 요약은 아직 bench-marking dataset이 없어서 더 challenge하다
    - Li et al.(2017)은 multimodal corpus를 수집
      - 500개 영어 뉴스 비디오, 기사들이 annotation된 요약으로 manually하게 짝지어지게
      - 해당 데이터셋은 small-scale이고
      - 오디오, 비디오, 텍스트 요약에 대해서 뉴스 기사가 있지만, 사람이 annotated한 audio-transcripts가 없다
본 연구와 관련있는 task는
- image/video 캡셔닝, description 생성, 비디오 이야기 생성 등이 있다
- 그 중, 가장 관련 있는 task는 video title 생성
  - 해당 task는 사용자의 관심을 끌 수 있는(-> 목표) title을 비디오의 중요한 사건에서 포착하는 것
  - Zhou et al.(2018)은 YouCook2 dataset을 제시
    - 설명하는 과정이 담긴 비디오를 포함
      - 특히, 요리 레시피(과정에 대해서 annotation이 이루어짐, temporally localized annotations for the procedure)
      - 그로 인해서 본 과정은 원래 비디오 부분과 절차 간의 시간 배열인데 summarization으로 보일 수도 있다

Method

본 연구에서 multimodal summarization을 연구
- 다양한 방법들로,
- open-domain instructional videos의 의도를 요약하기 위해서
- How2 dataset을 사용
  - 사람이 annotate한 비디오 요약을 포함하는 데이터셋(다양한 종류의 토픽들)
- 방법: transcriptions와 비디오에서 뽑은 visual features를 사용해서 description을 생성
- 새로운 evaluation metric(Content F1) 소개
  - 본 task에 적합하며, 해당 task를 이해하기 위한 상세한 결과들을 더 잘 나타냄

2 Multimodal Abstractive Summarization

How2 dataset
- 2000 시간의 짧은 instructional 비디오(여러 분야: 요리, 스포츠, 음악 등)
- 각 비디오는 사람이 생성한 transcript를 가지고 있으며, 2-3문장의 요약으로 구성
예시
- summary는 비디오에 대해서 전반적으로 잘 요약함
  - peppers가 “cut”되고 있음을 요약
  - “Cuban breakfast recipe” 이런 단어들의 경우 => transcript에 없는 내용임
- 텍스트와 비전 모달리티가 정보를 잘 보완하고 있으며 융합되었을 때 더 좋은 요약을 생성할 수 있도록 한다는 사실을 발견
- (더 나아가) speech 모달리티를 활용함
  - speech recognizer의 output을 요약 모델의 input에 넣어주어서!(원래는 human-annotated transcript를 넣음)
- How2 corpus
  - 73993 비디오(train), 2965(validation), 2156(test)
  - transcript의 평균 길이: 291 단어
  - 요약: 33 단어

Video-based Summarization.

video에서 sequential features를 어떻게 얻는가에 대한 과정

비디오를 feature로 represent할 때, feature들은 pre-trained action recognition model(ResNeXt-101 3D Convolutional Neural Network)에서 추출
- feature: 2048 차원, 비디오에서 16개의 각 non-overlapping frames에서 추출
결과로 비디오 당 a sequence of feature vectors를 얻는
- 이 sequential features들은 #3에서 묘사된 model에서 사용
  - ResNeXt-101 3D Convolutional Neural Networ: trained to recognize 400 different human actions in the Kinetics dataset

Speech-based Summarization.

어떻게 쓰는가? : pre-trained speech recognizer에서 output을 사용해서 speech modality를 활용함
- pre-trained speech recognizer는 text summarization model에서 input으로서 다른 데이터로 훈련됨
distant microphone conversational speech recognition을 위해서 SOTA 모델(ASpIRE, EESEN) 사용
- 해당 모델에 있어서 How2 test data의 word error는 35.4%
  - 데이터에서의 normalization issue에서 생긴 문제임
  - 예를 들어서, '20'을 'twenty'로 인지+라벨링함
  - 해당 부분을 잘 다루면 에러가 줄어들겠지만 본 논문은 이 상황을 그냥 받아들이기로 했다

Transfer Learning.

본 연구와 parallel한 연구인 Sanabria et al.(2019)는 summarization model의 사용을 summarization task를 기반으로 하는 transfer learning에 대해서 입증함
- Charades dataset은 오디오, 비디오, 텍스트(요약, 캡션, 질문-답변 페어) 모달리티로 구성
  - 해당 데이터는 How2 dataset과 유사함
Sanabria et al.(2019)는 Charades dataset을 활용한 task에서 How2 dataset으로 pre-training하고 transfer learning하는 것이 unimodal과 multimodal 적용 task에서 상당한 향상을 보임을 입증함

CMU Sinbad’s Submission for the DSTC7 AVSD Challenge(2019, AAAI)
Audio-Visual Scene-Aware Dialog (AVSD)

we first train models on the How2 data and then finetune (FT) them on the Charades dataset.

3 Summarization Models

여러 summarization model을 고려함

1. Recurrent Neural Network (RNN) Sequence-to-Sequence (S2S) model
- encoder RNN: attention mechanism으로 enocde(text, video features)하고자
- decoder RNN: summaries를 생성하고자.
1. Pointer-Generator (PG) model
- abstractive summarization에서 좋은 성능을 보임

Pointer Networks(2015, 논문, 논문 정리_민한님)

Pointer Net (Ptr-Net) 제시

pointer-generator network: 모델 accuracy와 OOV 처리를 위한 pointing(Vinyals et al., 2015)과 새로운 단어 생성을 위한 generator로 이루어져 있음

pointing을 통해 단어를 생성하고 generating으로 vocab 내에서 단어를 생성함

1. Libovický and Helcl 2017의 hierarchical attention approach 사용
- multimodal machine translation으로서 원래 제안(text를 생성하기 위해서 textual과 visual 모달리티를 결합함)
- 작동 과정
  - 모델은 처음에 각 input modalities (text/ video)에 대해서 context vector를 독립적으로 계산한다
  - context vectors는 다른 encoder의 states(상태)로 취급되며, 새로운 vector가 계산된다.

Attention Strategies for Multi-Source Sequence-to-Sequence Learning

Sequence-to-sequence (S2S) learning with attention mechanism

3.2 Hierarchical Attention Combination

concatenation과 유사하게 각 context vector를 독립적으로 계산한다

concatenation 대신에 second attention mechanism이 context vectors에 대해서 construct됨

attention distribution의 computation을 두 단계로 나눈다

1 compute the context vector for each encoder independently using Equation 3

2 project the context vectors (and optionally the sentinel) into a common space Equation 8

we compute another distribution over the projected context vectors (Equation 9)

and their corresponding weighted average (Equation 10)

비디오의 경우, single averaged vector 대신에 a sequence of action features들을 쓰면, RNN layer는 context를 포착하는데 도움을 준다
Fig2: 모델의 building block
- sequence-to-sequence 모델에 대한 building block
- 괄호 안 회색 숫자들은 어떤 실험에서 어떤 구성이 쓰일지를 나타냄
  
  숫자 의미(Tab1의 모델 숫자)

4 Evaluation

1) abstractive summarization에 ROUGE-L을 사용(표준 metric)
- reference와 generated summary 간의 가장 긴 흔한 시퀀스(Longest Common subsequence, LCS)를 측정

ROUGE-L 참고

ROUGE(Recall-Oriented Understudy for Gisting Evaluation)

텍스트 요약 모델의 성능 평가 지표

텍스트 자동 요약, 기계 번역 등 자연어 생성 모델의 성능을 평가하기 위한 지표

모델이 생성한 요약본 혹은 번역본을 사람이 미리 만들어 놓은 참조본과 대조해 성능 점수를 계산

ROUGE-L: LCS(Longest common subsequence) 기법을 이용해 최장 길이로 매칭되는 문자열을 측정

2) Content F1 metric 소개
- that fits the template-like structure of the summaries
- transcription과 summary에서 가장 잘 나타나는 단어를 분석
  - transcript에 있는 단어들은 대화 같은 즉흥적인 speech인 반면, summaries에 있는 단어들은 현상을 묘사하는 경우(descriptive nature)
- 예시(Tab A1)

Content F1.

단일언어정렬(monolingual alignment)에 기반을 두는 summaries에서의 content words에 대한 F1 score
- 단일언어정렬(monolingual alignment)의 질을 평가하기 위해 쓰이는 metric과 유사함
과정
- 1 METEOR toolkit 사용 -> alignment를 얻음
- 2 대다수의 요약에서 나타나는 1) function words, 2) task-specific stop words 제거
  - stop words는 예측이 쉬움 -> ROUGE score를 올림
  - reference와 hypothesis에서 남은 content words를 two bags of words로 다루고, 그 배열에 대해서 F1 score를 계산함
  - score가 결과의 fluency를 무시함을 인지해라

Human Evaluation
automatic evaluation외에도, 본 task를 더 잘 이해하고자 시행

Grusky et al. (2018)의 abstractive summarization human annotation work를 따름
1-5로 결과를 라벨링함(4개의 metric: informativeness, relevance, coherence, fluency에 기반을 두어서)
- 이를 test set에서 랜덤으로 샘플링된 500 비디오에서 수행
세 모델을 evaluate함
- two unimodal (text-only (5a), video-only (7)) 과 one multimodal (text-and-video (8)).
Amazon Mechanical Turk에서 세 annotator가 각 비디오를 annotate함
세부사항 A.5
- ground-truth summary로 세 모델의 결과를 비교하고 점수를 1(최저)-5(최고)로 부여

Ground truth : Ground-truth는 학습하고자 하는 데이터의 원본 혹은 실제 값을 표현할때 사용

the reality you want to model with your supervised machine learning algorithm.

also known as the target for training or validating the model with a labeled dataset.

5 Experiments and Results

Experiments
baseline은 크게 세 종류

1 RNN 언어모델
- 모든 summaries에서 훈련하고 그곳에서 token을 랜덤으로 샘플링함
- 결과로 얻은 output은 영어에 유창하고, ROUGE score 높음
- 그러나 내용은 관련이 없어서 Content F1 score가 낮음(Tab1의 Model No.1)
2 target summmary를 rule-based extracted summary(transcript에서의)로 바꿈(Model No.2a)
- “how to” 단어가 포함된 문장을, learn, tell, show, discuss,explain과 함께 사용했는데 이는 일반적으로 transcript에서 두 번째 문장이다.
3 Latent Dirichlet Allocation (LDA; Blei et al.,2003)에서 각 비디오의 nearest neighbor의 요약으로 훈련된 모델(Model No.2b)
- 2번(18.8)과 Content F1 score(17.9)가 유사함
  - content의 유사성을 입증 + Content F1 score의 유용함을 입증

Results
transcript(ground-truth transcript와 speech recognition output)와 video action feature를 사용해 -> 다양한 모달리티의 결합으로 여러 모델을 훈련함

1 text-only model은 인풋(650토큰)에서 완전한 transcript를 사용했을 때 가장 좋은 결과를 거둠
- 이는 news-domain summarization 관련 이전 연구와 반대다
2 해당 데이터에서 PG network(5b)가 S2S model(5a)만큼 잘하지 못함을 발견
- 이유
  - 1) abstractive nature of our summaries
  - 2) 인풋과 아웃풋 간의 common n-gram overlap의 부족(PG networks에서 중요한 요소임)
3 pretrained automatic speech recognizer에서 얻은 automatic transcriptions를 summarization model의 input으로서 사용(5c)
- 해당 모델은 video-only models(6-7)와 유사한 결과를 가져왔다
- ground-truth transcription summarization model보다는 현저히 낮다
- 이는 예상된 결과다 -> ASR errors의 큰 margin때문에(distant-microphone open-domain speech recognition에서)
two video-only models(6-7) 훈련
- 1. entrie video를 위해 single mean-pooled feature vector representation을 사용
- 1. 시간에 대해서 vector 별로 single layer RNN을 적용
인풋에서 action features만을 쓰는 것(7)이 text-only model과 비교했을 때 ROUGE와 Content F1 score에서 유사한 결과를 가져오는데 -> 이는 해당 task에서 modality 둘 다의 중요성을 입증한다
마지막으로 hierarchical attention model(8, 두 모달리티를 결합하는)은 높은 점수를 얻는다

Tab2 해석

text-only, video-only, multimodal models에서 최고 성과를 거둔 human evaluation의 결과다
세 evaluation measures에서 hierarchical attention이 있는 multimodal models가 최고점을 얻었다
appendix에 모델 하이퍼파라미터 환경 등에 대한 자세한 정보가 있다

Fig3 해석

해당 피규어를 어떻게 해석하는 것일까?

요약의 평균 길이가 33단어인데, 33 정도에서 density의 curve가 0.06보다 조금 작게 가장 높다, 결국 해당 커브를 다 칠하면 1이 나와야 되는데 600.25=0.15, 0.157=1.05 => 7가지 모델에서의 density를 다 더해주었을 때 그 합이 1이 되는 것 같다

다른 유니모달, 멀티모달에 대해서 human summaries에 대한 단어 분포 비교
Density curve: human annotated와 system produced summaries에 대한 길이 분포를 나타냄
human annotated reference로 생성된 요약의 다양한 시스템을 활용해, 단어 분포를 분석함
- density curve의 경우, 대다수의 모델들이 human annotations가 있는 action-only model(6)보다 짧음을 보임
- groundtruth text와 ASR이 있는 uni-modal 그리고 multimodal들은 길이가 유사함
  - 이는 Rouge-L과 Content F1 score의 상승이 길이가 아닌 content에서 기여했음을 보임
Tab A.2는 어떻게 결과가 다른지를 보여줌
- Tab A.2: text-only, text-and-video models로부터의 다양한 결과
text-only: reference와 유사한 유창한 결과 도출
action features가 있는 RNN model
- 인풋에서 없었던 text를 생성
- in-domain(“fly tying”’ and “fishing”) abstractive summary
  - equipment 같은 더 세부사항을 포함(text-based models에서는 없지만 연관된)
action features가 없는 RNN model
- relevant domain에 포함되지만 더 적은 세부사항을 포함함
nearest neighbor model
- "knot tying"과 연관이 있지만 "fishing"과 연관

6 Conclusions

How2data에서 abstractive text summaries를 생성하는 여러 baseline model을 제시
제시하는 모델은 video-only summarization model을 포함
- text-only model이랑 비슷한 결과임(경쟁력 있음)
향후 연구로, multi-document(multi-video) summaries 생성하기 + 비디오에서 end-to-end models로 바로 오디오 만들기
새로운 metric Contenct F1을 제시
- 비디오 요약의 evaluation으로서

후속연구:

MAST: Multimodal Abstractive Summarization with Trimodal
Hierarchical Attention(EMNLP, 2020)

코드리뷰

7 논문 후기

좋았던 점
- 깔끔하고 읽기 편하게 정리된 논문인 것 같다
아쉬웠던 점
- 유사 연구로 언급된 Sanabria et al.(2019)와 달리 본 논문에서 강조하고자 하는 차별점이 무엇인지 좀 더 명확히 설명해주었더라면 좋았을 것 같다 => 어떤 부분에서 contribution이 있는가에 대한 정리가 있거나
  - 논문을 읽고 떠오른 것은 Content F1 지표 제시, human annotation 진행
- Pointer-Generator (PG) model에 대한 설명이 없었던 부분, short paper이어서 그런지 다른 사용한 모델에 대한 설명들이 전반적으로 적었다

fla1512

이전 포스트