Academic

The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

arXiv:2603.09985v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their ability to accurately assess their own confidence remains poorly understood. We present an empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities. We evaluate four state-of-the-art models (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2) across four benchmark datasets totaling 24,000 experimental trials. Our results reveal striking calibration differences: Kimi K2 exhibits severe overconfidence with an Expected Calibration Error (ECE) of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieves the best calibration (ECE = 0.122) with 75.4% accuracy. These findings demonstrate that poorly performing models display markedly higher overconfidence -- a pattern analogous t

S
Sudipta Ghosh, Mrityunjoy Panday
· · 1 min read · 13 views

arXiv:2603.09985v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their ability to accurately assess their own confidence remains poorly understood. We present an empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities. We evaluate four state-of-the-art models (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2) across four benchmark datasets totaling 24,000 experimental trials. Our results reveal striking calibration differences: Kimi K2 exhibits severe overconfidence with an Expected Calibration Error (ECE) of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieves the best calibration (ECE = 0.122) with 75.4% accuracy. These findings demonstrate that poorly performing models display markedly higher overconfidence -- a pattern analogous to the Dunning-Kruger effect in human cognition. We discuss implications for safe deployment of LLMs in high-stakes applications.

Executive Summary

This study empirically investigates the Dunning-Kruger effect in large language models (LLMs) by evaluating four prominent models—Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2—across 24,000 trials across four benchmarks. The findings reveal a compelling analogy to the human cognitive bias: models with lower accuracy exhibit markedly higher overconfidence, with Kimi K2 showing an ECE of 0.726 at 23.3% accuracy, while Claude Haiku 4.5 demonstrates superior calibration (ECE = 0.122) at 75.4% accuracy. These results suggest a systemic misalignment between self-assessment and actual performance in LLMs, akin to the Dunning-Kruger effect, raising critical concerns for deployment in high-stakes domains.

Key Points

  • LLMs exhibit Dunning-Kruger-like overconfidence patterns
  • Variance in calibration error correlates with accuracy levels
  • Kimi K2 shows extreme overconfidence despite low accuracy

Merits

Empirical Rigor

The study uses standardized benchmarks and a sufficiently large sample size (24,000 trials) to establish robust correlations between confidence calibration and model performance.

Demerits

Generalizability Limitation

Results are specific to the four models tested; extrapolation to other LLMs or newer architectures may require additional validation.

Expert Commentary

The analogy between the Dunning-Kruger effect and LLM confidence calibration is both insightful and academically significant. This work adds a novel dimension to the discourse on AI cognition by applying a well-documented human bias to machine self-assessment. The empirical methodology is commendable, particularly the use of Expected Calibration Error as a quantitative proxy for bias. However, the study’s scope is constrained by its limited model diversity—incorporating more heterogeneous architectures, including open-source and proprietary variants, would strengthen the validity of the findings. Moreover, the implications extend beyond deployment: these findings may influence the design of next-generation LLMs with built-in self-assessment mechanisms. Overall, this is a pivotal contribution to the intersection of cognitive science and AI, offering a framework for future research on machine self-perception.

Recommendations

  • Integrate confidence calibration metrics into standard LLM evaluation protocols
  • Develop open-source tools to quantify and mitigate overconfidence in LLMs

Sources