Skip to yearly menu bar Skip to main content


Paper
in
Workshop: 8th Workshop and Competition on Affective & Behavior Analysis in-the-wild

Multimodal Emotion Prediction in Interpersonal Videos Integrating Facial and Speech Cues

Hajer Guerdelli · Claudio Ferrari · Stefano Berretti · Alberto Del Bimbo


Abstract:

Emotion prediction is essential for affective computing applications, including human-computer interaction and social behavior analysis. In interpersonal settings, accurately predicting emotional states is crucial for modeling social dynamics. We propose a multimodal framework that integrates facial expressions and speech cues to enhance emotion prediction in interpersonal video interactions. Facial features are extracted via a deep attention-based network, while speech is encoded using Wav2Vec 2.0. The resulting multimodal features are modeled temporally using a Long Short-Term Memory (LSTM) network. To adapt the IMEmo dataset for multimodal learning, we introduce a novel speech-feature alignment strategy that ensures synchronization between facial and vocal expressions. Our approach investigates the impact of multimodal fusion in emotion prediction, demonstrating its effectiveness in capturing complex emotional dynamics. Experiments show that our framework improves sentiment classification accuracy by over 17% compared to facial-only baselines. While fine-grained emotion recognition remains challenging, our results highlight the enhanced robustness and generalizability of our method in real-world interpersonal scenarios.

Chat is not available.