Academic

MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents

Zhenyu Wang, Xiaofen Xing, Yirong Chen, Xiangmin Xu · February 27, 2026 · 1 min read · 3 views

#cs.CL

arXiv:2602.21941v1 Announce Type: new Abstract: Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs, while delegating the assessment of their multimodal expressions solely to modality-synthesis metrics. This evaluation paradigm, on the one hand, entangles semantic assessment with modality generation, leading to ambiguous error attribution, and on the other hand remains constrained by the heavy reliance on human judgment. To this end, we propose MERRY, a semantically decoupled evaluation framework for assessing Multimodal Emotional and Role consistencies of Role-playing agents. This framework introduce five refined metrics for EC and three for RC. Notably, we transform the traditional subjective scoring approach into a novel bidirectional-evidence-finding task, significantly improving the human agreement of LLM-as-Judge evaluations. Based on MERRY, we conduct extensive evaluations. Our empirical results primarily reveal that: (1) Training on synthetic datasets tends to reduce emotional consistency, whereas training on real-world datasets improves it; (2) Existing models suffer from emotional templatization and simplification, exhibiting positive-bias and performance bottleneck in fine-grained negative emotions; (3) Simple prompting method strengthens the weak models but constrains the strong ones, while simple fine-tuning method suffers from poor role generalization. Codes and dataset are available.

Executive Summary

This article presents MERRY, a novel evaluation framework for assessing the multimodal emotional and role consistencies of role-playing agents. The framework introduces five refined metrics for evaluating emotional consistency and three for role consistency, significantly improving the accuracy of human evaluations. The authors conduct extensive evaluations using MERRY, revealing that training on synthetic datasets can reduce emotional consistency, while training on real-world datasets improves it. They also identify biases in existing models, including emotional templatization and simplification, as well as limitations in prompting and fine-tuning methods. The study contributes to the development of more accurate and reliable evaluation frameworks for multimodal role-playing agents.

Key Points

▸ MERRY introduces a semantically decoupled evaluation framework for multimodal role-playing agents
▸ The framework includes five refined metrics for emotional consistency and three for role consistency
▸ Training on synthetic datasets can reduce emotional consistency, while training on real-world datasets improves it

Merits

Strength in Evaluation Framework

MERRY decouples semantic assessment from modality generation, providing a more accurate evaluation of multimodal role-playing agents.

Improvement in Human Agreement

The framework significantly improves human agreement in evaluating LLM-as-Judge evaluations.

Demerits

Limitation in Dataset Generalizability

The study's findings may not generalize to other datasets or domains.

Bias in Existing Models

Existing models suffer from biases in emotional templatization and simplification.

Expert Commentary

The article presents a comprehensive evaluation framework for multimodal role-playing agents, addressing a critical gap in the field. The framework's ability to decouple semantic assessment from modality generation is a significant improvement over existing approaches. However, further research is needed to address the limitations of the study, including the generalizability of the findings to other datasets and domains. The study's results have important implications for the development of more accurate and reliable multimodal role-playing agents, which can be applied in various fields such as education, customer service, and healthcare.

Recommendations

✓ Recommendation 1: Future studies should investigate the generalizability of the MERRY framework to other datasets and domains.
✓ Recommendation 2: Researchers should explore the development of more robust evaluation frameworks that can address the biases identified in existing models.

Sources

arXiv - cs.CL

Something extraordinary is coming.

MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents

AI Commentary

Executive Summary

Key Points

Merits

Strength in Evaluation Framework

Improvement in Human Agreement

Demerits

Limitation in Dataset Generalizability

Bias in Existing Models

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.