A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations
arXiv:2602.23300v1 Announce Type: new Abstract: Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speec
arXiv:2602.23300v1 Announce Type: new Abstract: Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.
Executive Summary
This article proposes a novel Mixture-of-Experts (MoE) framework, MiSTER-E, for Multimodal Emotion Recognition in Conversations (ERC). The framework decouples two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. Experiments on three benchmark datasets demonstrate MiSTER-E's superiority over several baseline speech-text ERC systems. The proposed approach has the potential to improve the accuracy of ERC systems and has important implications for applications such as human-computer interaction and sentiment analysis.
Key Points
- ▸ MiSTER-E is a modular Mixture-of-Experts (MoE) framework for Multimodal Emotion Recognition in Conversations (ERC).
- ▸ The framework decouples modality-specific context modeling and multimodal information fusion.
- ▸ MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings.
Merits
Strength in Multimodal Fusion
MiSTER-E effectively integrates predictions from multiple modalities, allowing for better understanding of emotional cues in conversations.
Decoupling Modality-Specific Context Modeling
The framework's modular design enables the separate modeling of modality-specific context, which improves the accuracy of ERC systems.
Demerits
Dependence on Large Language Models
The performance of MiSTER-E heavily relies on the quality of the fine-tuned large language models, which may limit its applicability in scenarios where such models are not readily available.
Complexity of the Framework
The proposed approach involves a complex combination of models and layers, which may make it challenging to implement and optimize.
Expert Commentary
The proposed MiSTER-E framework is a significant contribution to the field of Multimodal Emotion Recognition in Conversations. The framework's modular design and effective multimodal fusion mechanism demonstrate a clear understanding of the challenges in ERC. However, the dependence on large language models and the complexity of the framework may limit its applicability and adoption. Nevertheless, the proposed approach has the potential to improve the accuracy of ERC systems and has important implications for various applications. Future research should focus on addressing the limitations of the framework and exploring its applications in real-world scenarios.
Recommendations
- ✓ Further research should be conducted to investigate the robustness of MiSTER-E in scenarios with limited availability of fine-tuned large language models.
- ✓ The development of more efficient and scalable implementations of the proposed framework is essential for its widespread adoption in practical applications.