Academic

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Soumya Dutta, Smruthi Balaji, Sriram Ganapathy · February 28, 2026 · 1 min read · 4 views

#cs.CL #eess.AS

arXiv:2602.23300v1 Announce Type: new Abstract: Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

Executive Summary

This article proposes a novel Mixture-of-Experts (MoE) framework, MiSTER-E, for Multimodal Emotion Recognition in Conversations (ERC). The framework decouples two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. Experiments on three benchmark datasets demonstrate MiSTER-E's superiority over several baseline speech-text ERC systems. The proposed approach has the potential to improve the accuracy of ERC systems and has important implications for applications such as human-computer interaction and sentiment analysis.

Key Points

▸ MiSTER-E is a modular Mixture-of-Experts (MoE) framework for Multimodal Emotion Recognition in Conversations (ERC).
▸ The framework decouples modality-specific context modeling and multimodal information fusion.
▸ MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings.

Merits

Strength in Multimodal Fusion

MiSTER-E effectively integrates predictions from multiple modalities, allowing for better understanding of emotional cues in conversations.

Decoupling Modality-Specific Context Modeling

The framework's modular design enables the separate modeling of modality-specific context, which improves the accuracy of ERC systems.

Demerits

Dependence on Large Language Models

The performance of MiSTER-E heavily relies on the quality of the fine-tuned large language models, which may limit its applicability in scenarios where such models are not readily available.

Complexity of the Framework

The proposed approach involves a complex combination of models and layers, which may make it challenging to implement and optimize.

Expert Commentary

The proposed MiSTER-E framework is a significant contribution to the field of Multimodal Emotion Recognition in Conversations. The framework's modular design and effective multimodal fusion mechanism demonstrate a clear understanding of the challenges in ERC. However, the dependence on large language models and the complexity of the framework may limit its applicability and adoption. Nevertheless, the proposed approach has the potential to improve the accuracy of ERC systems and has important implications for various applications. Future research should focus on addressing the limitations of the framework and exploring its applications in real-world scenarios.

Recommendations

✓ Further research should be conducted to investigate the robustness of MiSTER-E in scenarios with limited availability of fine-tuned large language models.
✓ The development of more efficient and scalable implementations of the proposed framework is essential for its widespread adoption in practical applications.

Sources

arXiv - cs.CL

Something extraordinary is coming.

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

AI Commentary

Executive Summary

Key Points

Merits

Strength in Multimodal Fusion

Decoupling Modality-Specific Context Modeling

Demerits

Dependence on Large Language Models

Complexity of the Framework

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.