Progressive Video Summarization via Multimodal Self-supervised Learning

헤고·2023년 1월 9일

paper

목록 보기
1/1

Key Contribution

  • Multi-modal self-supervised learning
    - 1) Cross-modality Semantic Correspondence: Fine-grained Modeling(local) + Coarse-grained Modeling(global)
    • 2) Temporal Dependencies in Videos: Recovering masking frames
  • Progressive Video Summarization: stage 사용해서 pinpointing important content, iteratively refine the video sequence by emphasizing the important content

pretraining

  • text encoder: [CLS], category, search query, title, description
  • video encoder: frame features (from GoogLeNet pool5 layer), fclsf_{cls}는 learnable feature -> g0g_0은 whole video에 대한 representation

objectives

Cross-modality Semantic Correspondence

  • coarse-grained modeling: predict whether the video and the text are corresponding
    - input: g0,z0g_0, z_0
    • BCE loss
  • fine-grained modeling: measure the distance bet. frame sets & text sets
    - contrastive loss

Temporal Dependencies in Videos

  • replace a randomly selected frame with a learnable feature m\mathbb{m}
  • encoded masked feature를 가지고 mask frame을 예측하는 것이 아니라, recover the frame by considering the temporal dependencies between the masked frame and whole video
    - MLP_s -> smooth transition인지 예측. ps=MLPs(gt)p_s = MLP_s(g_t), where gtg_t is the encoded feature of mm.
    • 1) if smooth transition: recover it by using only its neighbors(local info) with one-layer Transformer and linear proj.
    • 2) if abrupt transition: use gtg_t to recover it, as gtg_t contains global info
  • the masked frame is recovered by combining 1) & 2)

Progressive Summarizer

summarizer

  • frame score를 stage마다 곱해주는 식
    - s=s1s2...sNs^* = s^1\odot s^2\odot ... \odot s^N
  • with Text Info
    - pretrained text encoder에 fed한 뒤 나온 z0z_0을 visual modality와 합쳐서 scoring function에 넣음
    • s1=σ((G1+F1+MLPτ(z0))W1+b1)s^1 = \sigma((G^1 + F^1 + MLP_\tau (z_0))W^1 + b^1)

Experiment

  • SumMe: 비디오 이름을 search query로 사용, 나머지 3-types는 empty로
  • TVSum: 비디오 이름을 search query로 사용, categories 데이터 있음. 나머지 empty
profile
My goal is to found a company that can empower marginalized people and eventually better the world using AI.

0개의 댓글