MSA - Predict sentiment intensity or polarity (longer periods)
ERC - Predict predefined emotion categories (short periods)
=> sentiments and emotions are relevant, and could be projected into a unified embedding space.
UniMSE (Unified MSA and ERC)
MSA & ERC labels => Universal Labels (UL)
Pre-trained modality fusion layer (PMF)
Embed PMF to T5, fusing acoustic and visual information with different level textual features
Inter-modal contrastive learning (CL) => minimize intra-class variance and maximize inter-class variance across modailities.
Contribution (summarized)
UniMSE that unifies MSA and ERC tasks
Fuse multimodal representation from multi-level textual information
( Injecting A & V signals into the T5 model.)
SOTA on public benchmark datasets (MOSI, MOSEI, MELD, IEMOCAP)
First to solve MSA and ERC generatively + first to used unified A & V across MSA and ERC
2. Related Works
Multimodal Sentiment Analysis (MSA)
Emotion Recognition in Conversations (ERC)
Unified Framework
3. Method
3.1 Overall Architecture
Task formalization
Process MSA & ERC labels into universal label (UL) format (offline)
Pre-trained Modality Fusion
Unified feature extractors among datasets => Audio and Video features
2 individual LSTM for long-term contextual information
T5 as encoder for texual modality.
Embed multimodal fusion layers into T5
Follows FFN in some layers in T5
Inter-modal contrastive learning
Differentiate the multimodal fusion representations among samples
Narrow the gap between modalities of the same sample and push the modality representations of different samples further apart.
3.2 Task Formalization
Multimodal signal Ii={Iit,Iia,Iiv} (Video fragment i & {t,a,v} denote text, acoustic, visual)
MSA -> predict sentiment strength yir∈R
ERC -> predict emotion category of each utterence
Formalize with input formalization and label formalization step
3.2.1 Input formalization
Concatenate current utterance ui with 2 former and 2 latter ones
Iit=[ui−2,ui−2,ui,ui+1,ui+2]
Sit=[ui−2,ui−10,⋯,0,ui1,⋯,1,ui+1,ui+20,⋯,0] - Segment ID
Raw acoustic input -> librosa => extract Mel-spectogram as audio features
Video -> extract T frames from each segment -> efficientNet(pretrained on VGGface and AFEW dataset) -> video features
Classify samples of MSA and ERC into positive, neutral, and negative sample according to their sentiment polarity
Calculate the similarity of 2 sampels with same sentiment polarity but belonging to different annotation scheme.
Similarity -> textual similarity with SimCSE(sentence embedding framework)
- Since previous works demonstrated textual modality is more indicative than the other modalities.
Evaluate performance of generated labels => manual evaluation, accuracy about 90%.
3.3 Pre-trained Modality Fusion (PMF)
Embedding multimodal fusion layers into the pre-trained model.
Acoustic and visual signals used with multiple levels of textual information.
PMF unit in the Transformer layer of T5 receives a triplet Mi=(Xit,Xia,Xiv) -> maps multimodal concatenation back to layer's input size.
Multimodal fusion for j-th PMF Fi=[Fi(j−1)⊕Xia,la⊕Xiv,lv] Fid=σ(WdFi+bd) Fiu=WuFid+bu Fi(j)=W(Fiu⊙Fi(j−1))
Xia,la: hidden states of last time step of A-LSTM
Xiv,lv: hidden states of last time step of V-LSTM
Solution regarding two shortcomings of fusion
Can disturb the encoding of text sequence
Cause overfitting as more parameters are set for the multimodall fusion layer
=> Solution: Use former j Transformer layers to encode text and inject non-verbal signals to the remaining layers.
3.4 Inter-modality Contrastive Learning
Contrastive Learning (CL)
Gained advances in representation learning by viewing sample from multiple views.