MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents
arXiv:2602.21941v1 Announce Type: new Abstract: Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs, while delegating the assessment of their multimodal expressions solely to modality-synthesis metrics. This evaluation paradigm, on the one hand, entangles semantic assessment with modality generation, leading to ambiguous error attribution, and on the other hand remains constrained by the heavy reliance on human judgment. To this end, we propose MERRY, a semantically decoupled evaluation framework for assessing Multimodal Emotional and Role consistencies of Role-playing agents. This framework introduce five refined metrics for EC and three for RC. Notably, we transform the traditional subjective scoring approach into a novel bidirectional-evidence-finding task, significantly impro
arXiv:2602.21941v1 Announce Type: new Abstract: Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs, while delegating the assessment of their multimodal expressions solely to modality-synthesis metrics. This evaluation paradigm, on the one hand, entangles semantic assessment with modality generation, leading to ambiguous error attribution, and on the other hand remains constrained by the heavy reliance on human judgment. To this end, we propose MERRY, a semantically decoupled evaluation framework for assessing Multimodal Emotional and Role consistencies of Role-playing agents. This framework introduce five refined metrics for EC and three for RC. Notably, we transform the traditional subjective scoring approach into a novel bidirectional-evidence-finding task, significantly improving the human agreement of LLM-as-Judge evaluations. Based on MERRY, we conduct extensive evaluations. Our empirical results primarily reveal that: (1) Training on synthetic datasets tends to reduce emotional consistency, whereas training on real-world datasets improves it; (2) Existing models suffer from emotional templatization and simplification, exhibiting positive-bias and performance bottleneck in fine-grained negative emotions; (3) Simple prompting method strengthens the weak models but constrains the strong ones, while simple fine-tuning method suffers from poor role generalization. Codes and dataset are available.
Executive Summary
This article presents MERRY, a novel evaluation framework for assessing the multimodal emotional and role consistencies of role-playing agents. The framework introduces five refined metrics for evaluating emotional consistency and three for role consistency, significantly improving the accuracy of human evaluations. The authors conduct extensive evaluations using MERRY, revealing that training on synthetic datasets can reduce emotional consistency, while training on real-world datasets improves it. They also identify biases in existing models, including emotional templatization and simplification, as well as limitations in prompting and fine-tuning methods. The study contributes to the development of more accurate and reliable evaluation frameworks for multimodal role-playing agents.
Key Points
- ▸ MERRY introduces a semantically decoupled evaluation framework for multimodal role-playing agents
- ▸ The framework includes five refined metrics for emotional consistency and three for role consistency
- ▸ Training on synthetic datasets can reduce emotional consistency, while training on real-world datasets improves it
Merits
Strength in Evaluation Framework
MERRY decouples semantic assessment from modality generation, providing a more accurate evaluation of multimodal role-playing agents.
Improvement in Human Agreement
The framework significantly improves human agreement in evaluating LLM-as-Judge evaluations.
Demerits
Limitation in Dataset Generalizability
The study's findings may not generalize to other datasets or domains.
Bias in Existing Models
Existing models suffer from biases in emotional templatization and simplification.
Expert Commentary
The article presents a comprehensive evaluation framework for multimodal role-playing agents, addressing a critical gap in the field. The framework's ability to decouple semantic assessment from modality generation is a significant improvement over existing approaches. However, further research is needed to address the limitations of the study, including the generalizability of the findings to other datasets and domains. The study's results have important implications for the development of more accurate and reliable multimodal role-playing agents, which can be applied in various fields such as education, customer service, and healthcare.
Recommendations
- ✓ Recommendation 1: Future studies should investigate the generalizability of the MERRY framework to other datasets and domains.
- ✓ Recommendation 2: Researchers should explore the development of more robust evaluation frameworks that can address the biases identified in existing models.