Text part
Image part
VideoBERT
Training
Downstream Tasks
Visual and text towers are trained separately with unimodal BERTs, followed by cross-modal Transformer to learn multimodal correspondence.
NCE loss does not need labels.
The entire model is trained end-to-end, weighted-summing all three losses (BERT, CBT, Cross-modal).
Task: Moment localization in Video Corpus (MLVC)
Two-Stage approach
Learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech (ASR) - in an entirely label-free, self-supervised manner.
Learning objectives
Visual + Audio + Text
(Multimodal) Contrastive learning setting
Multimodal Projection Head
A multimodal metric learning using large-scale paired dataset
The text and image encoders are useful themselves!
📙 강의