Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
arXiv:2604.00477v1 Announce Type: new Abstract: LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble diversity-Big Five personality conditioning makes agen
arXiv:2604.00477v1 Announce Type: new Abstract: LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble diversity-Big Five personality conditioning makes agents probe different quality dimensions, with expert judges acting as adversarial probes that push discovery into the tail of the finding distribution. A controlled ablation confirms that structured persona conditioning, not simple prompting, is required to produce these scaling properties.
Executive Summary
This article presents an innovative approach to evaluating conversational AI models through agent-based evaluation, leveraging large language models (LLMs) to mimic human raters. A key finding is the logarithmic improvement in quality scores with increasing panel size, but a sublinear power-law relationship between panel size and unique issue discoveries. The authors propose that this discrepancy arises from a power-law distribution of the finding space, where critical issues are discovered by small panels and corner cases require larger panels. This study highlights the importance of ensemble diversity and structured persona conditioning in producing these scaling properties.
Key Points
- ▸ Agent-based evaluation using LLMs can produce evaluations indistinguishable from human raters.
- ▸ Quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power-law.
- ▸ The power-law distribution of the finding space is proposed to explain the score-coverage dissociation.
- ▸ Ensemble diversity and structured persona conditioning are crucial for producing these scaling properties.
Merits
Methodological Innovation
The article introduces a novel approach to agent-based evaluation, leveraging LLMs to mimic human raters, and provides a comprehensive analysis of the results.
Theoretical Insights
The study offers a nuanced understanding of the power-law distribution of the finding space and its implications for conversational AI evaluation.
Practical Applications
The findings have significant implications for the development and evaluation of conversational AI models in real-world applications.
Demerits
Limited Generalizability
The study is conducted with a specific set of models and tasks, which may not be representative of the broader conversational AI landscape.
Lack of Human Judgment
The evaluation relies solely on human raters, which may not accurately reflect real-world human interactions with conversational AI models.
Expert Commentary
The article presents a compelling argument for the use of agent-based evaluation in conversational AI research. The findings have significant implications for the development and deployment of conversational AI models. However, the study's limitations should be acknowledged, and further research should aim to generalize the results to a broader set of models and tasks. The article highlights the importance of ensemble diversity and structured persona conditioning, which should be a focus for future research in this area.
Recommendations
- ✓ Future studies should investigate the generalizability of the results to a broader set of models and tasks.
- ✓ Developers and policymakers should prioritize the development of more diverse and robust conversational AI models to address the power-law distribution of the finding space.
Sources
Original: arXiv - cs.AI