Academic

"Who Am I, and Who Else Is Here?" Behavioral Differentiation Without Role Assignment in Multi-Agent LLM Systems

arXiv:2604.00026v1 Announce Type: new Abstract: When multiple large language models interact in a shared conversation, do they develop differentiated social roles or converge toward uniform behavior? We present a controlled experimental platform that orchestrates simultaneous multi-agent discussions among 7 heterogeneous LLMs on a unified inference backend, systematically varying group composition, naming conventions, and prompt structure across 12 experimental series (208 runs, 13,786 coded messages). Each message is independently coded on six behavioral flags by two LLM judges from distinct model families (Gemini 3.1 Pro and Claude Sonnet 4.6), achieving mean Cohen's kappa = 0.78 with conservative intersection-based adjudication. Human validation on 609 randomly stratified messages confirmed coding reliability (mean kappa = 0.73 vs. Gemini). We find that (1) heterogeneous groups exhibit significantly richer behavioral differentiation than homogeneous groups (cosine similarity 0.56 v

H
Houssam EL Kandoussi
· · 1 min read · 3 views

arXiv:2604.00026v1 Announce Type: new Abstract: When multiple large language models interact in a shared conversation, do they develop differentiated social roles or converge toward uniform behavior? We present a controlled experimental platform that orchestrates simultaneous multi-agent discussions among 7 heterogeneous LLMs on a unified inference backend, systematically varying group composition, naming conventions, and prompt structure across 12 experimental series (208 runs, 13,786 coded messages). Each message is independently coded on six behavioral flags by two LLM judges from distinct model families (Gemini 3.1 Pro and Claude Sonnet 4.6), achieving mean Cohen's kappa = 0.78 with conservative intersection-based adjudication. Human validation on 609 randomly stratified messages confirmed coding reliability (mean kappa = 0.73 vs. Gemini). We find that (1) heterogeneous groups exhibit significantly richer behavioral differentiation than homogeneous groups (cosine similarity 0.56 vs. 0.85; p < 10^-5, r = 0.70); (2) groups spontaneously exhibit compensatory response patterns when an agent crashes; (3) revealing real model names significantly increases behavioral convergence (cosine 0.56 to 0.77, p = 0.001); and (4) removing all prompt scaffolding converges profiles to homogeneous-level similarity (p < 0.001). Critically, these behaviors are absent when agents operate in isolation, confirming that behavioral diversity is a structured, reproducible phenomenon driven by the interaction of architectural heterogeneity, group context, and prompt-level scaffolding.

Executive Summary

This study explores the behavior of multiple large language models (LLMs) interacting in shared conversations, revealing that heterogeneity among models leads to richer behavioral differentiation than homogeneous groups. The authors introduce a novel experimental platform to analyze the effects of group composition, naming conventions, and prompt structure on LLM behavior. Key findings include: behavioral differentiation is more pronounced in heterogeneous groups, compensatory response patterns emerge when an agent crashes, and revealing model names increases behavioral convergence. These results have significant implications for the development of multi-agent systems and highlight the importance of considering architectural heterogeneity and group context in LLM interactions.

Key Points

  • Heterogeneous groups exhibit richer behavioral differentiation than homogeneous groups
  • Compensatory response patterns emerge when an agent crashes
  • Revealing model names increases behavioral convergence

Merits

Strength

The study employs a well-designed experimental platform, allowing for systematic variation of group composition and prompt structure. The independent coding of messages by two LLM judges from distinct model families enhances the reliability of the findings.

Demerits

Limitation

The study relies on a limited number of model families (Gemini 3.1 Pro and Claude Sonnet 4.6), which may not be representative of the broader LLM landscape. Additionally, the study focuses on a relatively small number of experimental runs (208 runs, 13,786 coded messages).

Expert Commentary

This study makes a significant contribution to the field of multi-agent systems by demonstrating the importance of considering architectural heterogeneity and group context in LLM interactions. The findings have important implications for the development of collaborative LLM applications, and highlight the need for further research in this area. While the study has some limitations, such as the use of a limited number of model families, the results are robust and reliable. The use of a well-designed experimental platform and independent coding of messages by two LLM judges from distinct model families enhances the reliability of the findings. Overall, this study is a valuable addition to the literature on LLMs and multi-agent systems.

Recommendations

  • Future studies should investigate the effects of LLM heterogeneity on collaborative performance in a wider range of applications.
  • The development of policies governing the use of multi-agent LLM systems should take into account the findings of this study, particularly with regards to the importance of considering architectural heterogeneity and group context.

Sources

Original: arXiv - cs.CL