Academic

Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus

arXiv:2604.03809v1 Announce Type: new Abstract: Multi-agent LLM committees replicate the same model under different role prompts and aggregate outputs by majority vote, implicitly assuming that agents contribute complementary evidence. We embed each agent's chain-of-thought rationale and measure pairwise similarity: across 100 GSM8K questions with three Qwen2.5-14B agents, mean cosine similarity is 0.888 and effective rank is 2.17 out of 3.0, a failure mode we term representational collapse. DALC, a training-free consensus protocol that computes diversity weights from embedding geometry, reaches 87% on GSM8K versus 84% for self-consistency at 26% lower token cost. Ablation experiments reveal 1-3 point per-protocol run-to-run variance, confirm that hint sharing contributes more than diversity weighting alone, and show that encoder choice strongly modulates collapse severity (cosine 0.908 with mxbai versus 0.888 with nomic) and downstream accuracy. The more robust finding is that collap

D
Dipkumar Patel
· · 1 min read · 23 views

arXiv:2604.03809v1 Announce Type: new Abstract: Multi-agent LLM committees replicate the same model under different role prompts and aggregate outputs by majority vote, implicitly assuming that agents contribute complementary evidence. We embed each agent's chain-of-thought rationale and measure pairwise similarity: across 100 GSM8K questions with three Qwen2.5-14B agents, mean cosine similarity is 0.888 and effective rank is 2.17 out of 3.0, a failure mode we term representational collapse. DALC, a training-free consensus protocol that computes diversity weights from embedding geometry, reaches 87% on GSM8K versus 84% for self-consistency at 26% lower token cost. Ablation experiments reveal 1-3 point per-protocol run-to-run variance, confirm that hint sharing contributes more than diversity weighting alone, and show that encoder choice strongly modulates collapse severity (cosine 0.908 with mxbai versus 0.888 with nomic) and downstream accuracy. The more robust finding is that collapse is measurable, worsens on harder tasks, and that the choice of embedding proxy is a first-order design decision for any latent communication protocol.

Executive Summary

This article investigates the phenomenon of 'representational collapse' in multi-agent Large Language Model (LLM) committees, where agents' outputs converge despite being trained under different role prompts. The authors propose a diversity-aware consensus protocol (DALC) that improves performance and reduces token cost. Key findings include: (1) representational collapse can be measured using pairwise similarity; (2) hint sharing contributes more to performance than diversity weighting alone; and (3) the choice of encoder strongly affects collapse severity and downstream accuracy. The study highlights the importance of embedding proxy selection for latent communication protocols.

Key Points

  • Representational collapse in multi-agent LLM committees can be measured and characterized using pairwise similarity.
  • The proposed diversity-aware consensus protocol (DALC) improves performance and reduces token cost compared to self-consistency.
  • Encoder choice significantly affects collapse severity and downstream accuracy in LLM committees.

Merits

Strength

The study provides a novel framework for measuring and characterizing representational collapse in multi-agent LLM committees, shedding light on a critical issue in the field.

Demerits

Limitation

The study relies on specific model architectures (Qwen2.5-14B and mxbai/nomic) and may not generalize to other LLM models or tasks.

Expert Commentary

The article presents a compelling analysis of representational collapse in multi-agent LLM committees and proposes a novel diversity-aware consensus protocol (DALC) that improves performance and reduces token cost. The study's findings have significant implications for the development of multi-agent LLM systems and highlight the importance of embedding proxy selection for latent communication protocols. However, the study's reliance on specific model architectures and tasks may limit its generalizability to other LLM models or tasks. Overall, the study makes a valuable contribution to the field of NLP and demonstrates the potential of diversity-aware consensus protocols for improving the performance of multi-agent LLM systems.

Recommendations

  • Future research should investigate the applicability of DALC to other LLM models and tasks.
  • The development of more robust and generalizable methods for measuring and characterizing representational collapse is necessary for the advancement of multi-agent LLM systems.

Sources

Original: arXiv - cs.LG