Audio–Visual Fusion for Emotion Recognition in the Valence–Arousal Space Using Joint Cross-Attention  
Finally, mean and variance normalization is performed on the spectrogram.
Apart from mean and variance normalization, no other voice specific processing such as silence removal, noise filtering, etc are performed
 A Visual–Audio-Based Emotion Recognition System Integrating Dimensional Analysis 
The lower part of the graph is the audio path. For this part, the extracted audio stream is preprocessed such as noise reduction and silence removal, and then the preprocessed audio stream is sent to the feature extraction module