



Text part
Image part






VideoBERT
Training

Downstream Tasks




Visual and text towers are trained separately with unimodal BERTs, followed by cross-modal Transformer to learn multimodal correspondence.
NCE loss does not need labels.
The entire model is trained end-to-end, weighted-summing all three losses (BERT, CBT, Cross-modal).

Task: Moment localization in Video Corpus (MLVC)
Two-Stage approach



Learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech (ASR) - in an entirely label-free, self-supervised manner.
Learning objectives



Visual + Audio + Text
(Multimodal) Contrastive learning setting
Multimodal Projection Head

A multimodal metric learning using large-scale paired dataset
The text and image encoders are useful themselves!

📙 강의