DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles
arXiv:2603.20975v1 Announce Type: new Abstract: Multi-agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents' reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter-agent disagreement -- both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) -- to produce well-calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ-LLM (logistic regression on LLM-extracted structure features), DiscoUQ-Embed (logistic regression on embedding geometry), and DiscoUQ-Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU,
arXiv:2603.20975v1 Announce Type: new Abstract: Multi-agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents' reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter-agent disagreement -- both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) -- to produce well-calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ-LLM (logistic regression on LLM-extracted structure features), DiscoUQ-Embed (logistic regression on embedding geometry), and DiscoUQ-Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU, TruthfulQA, ARC-Challenge) with a 5-agent system using Qwen3.5-27B, DiscoUQ-LLM achieves an average AUROC of 0.802, outperforming the best baseline (LLM Aggregator, 0.791) while being substantially better calibrated (ECE 0.036 vs. 0.098). The learned features generalize across benchmarks with near-zero performance degradation and provide the largest improvements where they are most needed: in the ambiguous "weak disagreement" tier where simple vote counting fails.
Executive Summary
This article introduces DiscoUQ, a novel framework for uncertainty quantification in multi-agent LLM systems. DiscoUQ extracts the structure of inter-agent disagreement to produce well-calibrated confidence estimates. The framework proposes three methods of increasing complexity, achieving state-of-the-art performance on four diverse benchmarks. The results demonstrate the effectiveness of DiscoUQ in accurately quantifying uncertainty, particularly in ambiguous cases. The learned features generalize across benchmarks, providing substantial improvements where they are most needed.
Key Points
- ▸ DiscoUQ extracts the structure of inter-agent disagreement to improve uncertainty quantification
- ▸ The framework proposes three methods of increasing complexity, achieving state-of-the-art performance
- ▸ DiscoUQ demonstrates effectiveness in accurately quantifying uncertainty, particularly in ambiguous cases
Merits
Strength of DiscoUQ
DiscoUQ's primary merit lies in its ability to leverage the rich semantic information in agents' reasoning, allowing for well-calibrated confidence estimates. The framework's success in achieving state-of-the-art performance on diverse benchmarks underscores its efficacy.
Demerits
Limited Generalizability
While DiscoUQ demonstrates impressive performance on the provided benchmarks, its generalizability to other domains and tasks remains uncertain. Further research is required to establish the framework's adaptability to diverse applications.
Computational Complexity
The DiscoUQ framework's increased complexity may lead to higher computational requirements, potentially limiting its applicability to resource-constrained environments.
Expert Commentary
The introduction of DiscoUQ marks a significant milestone in the quest for reliable and trustworthy AI systems. By leveraging the structure of inter-agent disagreement, the framework demonstrates a novel approach to uncertainty quantification. While limitations persist, the article's findings underscore the importance of continued research in this area. As AI systems become increasingly prevalent, the need for robust uncertainty quantification methods will only continue to grow. DiscoUQ's potential to address this challenge positions it as a valuable contribution to the field.
Recommendations
- ✓ Further research should focus on establishing DiscoUQ's generalizability to diverse domains and tasks.
- ✓ Investigations into the framework's computational complexity and scalability are necessary to ensure its practical applicability.
Sources
Original: arXiv - cs.CL