Progressive Video Summarization via Multimodal Self-supervised Learning

헤고·2023년 1월 9일

paper

목록 보기

1/1

Multi-modal self-supervised learning
- 1) Cross-modality Semantic Correspondence: Fine-grained Modeling(local) + Coarse-grained Modeling(global)
- 2) Temporal Dependencies in Videos: Recovering masking frames
Progressive Video Summarization: stage 사용해서 pinpointing important content, iteratively refine the video sequence by emphasizing the important content

pretraining

text encoder: [CLS], category, search query, title, description
video encoder: frame features (from GoogLeNet pool5 layer), $f_{cls}$ 는 learnable feature -> $g_0$ 은 whole video에 대한 representation

Cross-modality Semantic Correspondence

coarse-grained modeling: predict whether the video and the text are corresponding
- input: $g_0, z_0$
- BCE loss
fine-grained modeling: measure the distance bet. frame sets & text sets
- contrastive loss

Temporal Dependencies in Videos

replace a randomly selected frame with a learnable feature $\mathbb{m}$
encoded masked feature를 가지고 mask frame을 예측하는 것이 아니라, recover the frame by considering the temporal dependencies between the masked frame and whole video
- MLP_s -> smooth transition인지 예측. $p_s = MLP_s(g_t)$ , where $g_t$ is the encoded feature of $m$ .
- 1) if smooth transition: recover it by using only its neighbors(local info) with one-layer Transformer and linear proj.
- 2) if abrupt transition: use $g_t$ to recover it, as $g_t$ contains global info
the masked frame is recovered by combining 1) & 2)

summarizer

frame score를 stage마다 곱해주는 식
- $s^* = s^1\odot s^2\odot ... \odot s^N$
with Text Info
- pretrained text encoder에 fed한 뒤 나온 $z_0$ 을 visual modality와 합쳐서 scoring function에 넣음
- $s^1 = \sigma((G^1 + F^1 + MLP_\tau (z_0))W^1 + b^1)$

My goal is to found a company that can empower marginalized people and eventually better the world using AI.