Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for LLM Truthfulness
arXiv:2603.06612v1 Announce Type: new Abstract: Pass@k and other methods of scaling inference compute can improve language model performance in domains with external verifiers, including mathematics and code, where incorrect candidates can be filtered reliably. This raises a natural question: can we similarly scale compute to elicit gains in truthfulness for domains without convenient verification? We show that across five benchmarks and models, surprisingly, it cannot. Even at 25x the inference cost of naive sampling, polling-style aggregation yields no consistent accuracy gains over single-sample baselines and often amplifies shared misconceptions. We find that under uncertainty, models are better at predicting what other models will say within model ensembles than at identifying what is true, revealing a separation between social prediction and truth verification. Across models and benchmarks, aggregation fails to provide a robust truth signal because language model errors are stro
arXiv:2603.06612v1 Announce Type: new Abstract: Pass@k and other methods of scaling inference compute can improve language model performance in domains with external verifiers, including mathematics and code, where incorrect candidates can be filtered reliably. This raises a natural question: can we similarly scale compute to elicit gains in truthfulness for domains without convenient verification? We show that across five benchmarks and models, surprisingly, it cannot. Even at 25x the inference cost of naive sampling, polling-style aggregation yields no consistent accuracy gains over single-sample baselines and often amplifies shared misconceptions. We find that under uncertainty, models are better at predicting what other models will say within model ensembles than at identifying what is true, revealing a separation between social prediction and truth verification. Across models and benchmarks, aggregation fails to provide a robust truth signal because language model errors are strongly correlated. The source of correlation goes beyond any individual benchmark: we show that even when conditioned on out of distribution random strings and asked to produce pseudo-random outputs, different models produce correlated outputs. Confidence-based weighting provides no benefit because self-reported confidence fails to reliably distinguish correct from incorrect answers. These results delineate a boundary for inference-time scaling: in verified domains, additional samples provide more candidates for a verifier to filter; in unverified domains, additional samples merely reinforce shared misconceptions.
Executive Summary
This article presents a thorough investigation into the efficacy of crowd wisdom strategies, specifically scaling inference compute, in enhancing language model truthfulness in domains lacking external verifiers. Contrary to expectations, the study reveals that even with 25 times the inference cost of naive sampling, polling-style aggregation fails to demonstrate consistent accuracy gains over single-sample baselines and often amplifies shared misconceptions. The findings highlight a critical separation between social prediction and truth verification, underscoring the importance of understanding the limitations of language models in unverified domains.
Key Points
- ▸ Crowd wisdom strategies fail to improve language model truthfulness in unverified domains.
- ▸ Polling-style aggregation often amplifies shared misconceptions.
- ▸ Language model errors are strongly correlated, leading to a failure of aggregation to provide a robust truth signal.
Merits
Methodological Rigor
The study employs a robust methodology, utilizing five benchmarks and multiple models to investigate the efficacy of crowd wisdom strategies.
Thorough Analysis
The authors provide a comprehensive analysis of the limitations of language models in unverified domains, highlighting a critical separation between social prediction and truth verification.
Demerits
Limited Generalizability
The study's findings may not be generalizable to other language models or domains, highlighting the need for further research.
Lack of Experimental Controls
The authors do not provide a detailed description of the experimental controls used, which may limit the interpretability of the results.
Expert Commentary
The study presents a nuanced understanding of the limitations of language models in unverified domains, highlighting the importance of verification in ensuring the accuracy and reliability of AI systems. The findings have significant implications for the development and deployment of language models, as well as the broader AI research community. The study's methodology and analysis are robust and thorough, providing a valuable contribution to the field. However, the limited generalizability of the findings and the lack of experimental controls may limit the interpretability of the results.
Recommendations
- ✓ Further research is needed to investigate the efficacy of alternative verification mechanisms in unverified domains.
- ✓ Developers and deployers of language models should prioritize the implementation of robust verification mechanisms to ensure the accuracy and reliability of AI systems.