Existing research lacks a clear definitionof active speakers
Fail to model audio-visual synchronization and often classify unsynchronized videos as active speaking
Proposal
New definition that requires synchronization between audio and visual speaking activities
Cross-modal contrastive learning strategy
Positional encoding in attention modules for supervised ASD models to leverage the synchronization cue
Experimental results
Model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models
1. Introudction
Importance of active speaker detection (ASD)
ASD is crucial for various downstream tasks such as speaker recognition, speaker diarization, speech separation, and human-computer interaction
Problem
No consistent definition of active speakers in the literature
with some studies requiring synchronized speaking signals from the same person in both audio and visual modalities
while others allow non-synchronized signals from different but related persons
(e.g., Dubbed movies)
Audio-visual synchronization
Aligns with the definition in the Active Speakers in the Wild (ASW) dataset [ref]
Requiring audio-visual synchronization is more practical than not requiring
Dubbed movies (translated movies or documentaries with a narrator)
Challenging to determine which person should be considered speaking in the video
- Which person should be considered speaking in the video, the person shown in the visual scene, the unseen narrator, or both?
- What degree of relevance should be used in the definition of active speakers?
- ⇒ It is clear whether audio and visual speaking signals are synchronized or not.
If the signals are not synchronized, there are generally two cases
(1) Mismatch
The audio and the face track do not correspond to the same speaking content
(2) Misalignment
The content and identity are the same for audio and face tracks, but one modality is delayed
Perceptual studies that suggest delays become detectable by ordinary people if they are greater than 125ms for audio delay and 45ms for visual delay.
Proposal
Use cross-modal contrastive learning
Apply positional encoding in an attention module when fusing audio and visual embeddings
can temporally align
Perform better on synthesized unsynchronized videos along with natural videos.
2. Related Work
Active Speaker Detection
Utilizing the audio-visual correlation in videos
“Look who’s talking: Speakerm detection using video and audio correlation,” in Proc. ICME, 2000
“Audio-visual speaker localization via weighted clustering,” in Proc. MLSP, 2014
Improving the modality encoding method
Naver at ActivityNet Challenge 2019–task B active speaker detection (AVA)
Focus on the fusion method or leverage context
UniCon: Unified context network for robust active speaker detection, in Proc. ACM Multimedia, 2021
How to design a three-stage architecture for audio-visual active
speaker detection in the wild,” in Proc. ICCV, 2021
⇒ none of the existing methods explicitly model audio-visual synchronization
(1) Can current models correctly label unsynchronized videos as "Not Speaking"?
(2) What do current models really learn?
Problem of exsisting model
Current models tend to make false-positive predictionsonunsynchronized videos, they fail to detect unsynchronized videos as "Not Speaking.
3.1. Unsynchronization test by augmentation
Create unsynchronized video segments (mismatched and misaligned) from original test videos
both the AVA validation set and the ASW test set
unsynchronized videos take different proportions but have the same total number of videos
1) Mismatched video segments
How to make?
Randomly swapping the audio of speaking segments of the original videos
For each face track, replace the audio of each speaking segment with another random speaking segment from different videos
Show lip movements in the video and speaking voices in the audio, but these activities do not match
ASD labels are set as negatives
2) Misaligned video segments
How to make?
Shifting the original audio of speaking segments in time
Randomly shift the speaking segment's original audio to the left or right by a time shift greater than 125ms, the human detectable threshold of any delay
Performance on existing research
Performance with five different proportions of unsynchronized videos
Both models do not properly model audio-visual synchronization
Hypothesis
Rely on individual modality features and basic audio-visual correlations to classify videos
Ignore the synchronization cue
3.2. Understanding what existing ASD models learn
What existing ASD models learn?
Remove key information from audio and visual tracks
Silencing the audio tracks
Masking the bottom 30% of visual frames of each face track with zero to cover the lips
Performance on existing research
Both RothNet and TalkNet deteriorate dramatically in both cases
Showing that both models use voice activity and lip movement information for ASD
Combined model
Then, the authors train a voice activity detection (VAD) model and a lip movement detection model modified from the audio and visual frontends of TalkNet
The probability of speaking is calculated as the product of the probabilities predicted by the two models
The combined model's mAP in the AVA val set is 90.72%, close to that of TalkNet, indicating that using only a VAD and a lip movement detection model is able to perform comparably with the SoTA ASD models
4. Method
4.1. Cross-modal contrastive learning
Objective
To address the lack of unsynchronized data in the training dataset
Method
Augment the features in the embedding space to enforce contrastive learning
Positive samples (Sets: Γ)
at lest one yit=1
Negative samples
Randomly exchanging the audio embeddings Aγ of a face track with audio embeddings of another positive sampleAϕ(γ)
γ and ϕ(γ) are indexes of two randomly selected face tracks from Γ, and ϕ(γ) is different from γ
Mathematically, the additional contrastive samples are (Vγ,Aϕ(γ),yγ), where γ,ϕ(γ)∈Γ,ϕ(γ)=γ.
We only use the face tracks which contain positive frames for contrastive learning
The cross-modal features are computed with cross-attention module, where Fia→v and Fiv→a are concatenated and passed through the self-attention layer.
Positional Encoding (PE)
The proposed method adds positional encoding to the attention modules to leverage synchronization cues.
Without the positional encoding, the cross-attention layer is permutation-invariant for the
inputs, which makes it difficult for the model to learn the synchronization between visual and audio
5. Experiment
5.1 Performance of the proposed method
Unsynchroinzation test
The results show that Sync-TalkNet and Sync-RothNet outperform the two baseline models, RothNet and TalkNet, in the augmented test sets.
Sync-TalkNet achieves better results than TalkNet on the ASW test set and slightly lower results on the AVA val set.
The proposed method leverages both the advantages of supervised and self-supervised ASD models, achieving excellent performance on both original and unsynchronized augmented datasets.
Narrated videos detection.
Trained on the ASW dataset, to detect unsynchronized dubbed movies in the AVA validation set
True positive rate (TPR)
the ratio of positively predicted frames to positively labeled frames
lower TPR indicates the more likely the video is from a dubbed movie.
Manual checking confirmed that the three videos with the lowest TPR are dubbed movies, while the three with the highest TPR are not dubbed movies.
Suggests that Sync-TalkNet may be useful for detecting unsynchronized dubbed videos
5.2 Ablation study
The effects of positional encoding and cross-modal contrastive learning
Cross-modal contrastive learning is crucial for learning synchronization
Removing positional encoding causes a performance drop, but not a catastrophic one
As the model can still learn weak timeline information with the guidance of contrastive learning
Impact of applying positional encoding on both cross-attention and self-attention modules, with results indicating that doing so helps Sync-TalkNet better perceive timeline information