Rescaling Confidence: What Scale Design Reveals About LLM Metacognition
arXiv:2603.09309v1 Announce Type: new Abstract: Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0--100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using meta-d'. We find that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized unc
arXiv:2603.09309v1 Announce Type: new Abstract: Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0--100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using meta-d'. We find that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized uncertainty and should be treated as a first-class experimental variable in LLM evaluation.
Executive Summary
The article 'Rescaling Confidence: What Scale Design Reveals About LLM Metacognition' presents a critical examination of the confidence scale used in Large Language Models (LLMs). By manipulating the scale design along three dimensions - granularity, boundary placement, and range regularity - the authors demonstrate that the standard 0-100 scale has significant limitations. Their findings suggest that a 0-20 scale improves metacognitive efficiency, while boundary compression degrades performance. The study emphasizes the importance of treating the confidence scale as a first-class experimental variable in LLM evaluation. This research has significant implications for the development and evaluation of LLMs, particularly in high-stakes applications.
Key Points
- ▸ The standard 0-100 confidence scale is heavily discretized, with 78% of responses concentrating on three round-number values.
- ▸ A 0-20 scale improves metacognitive efficiency over the standard 0-100 format.
- ▸ Boundary compression degrades metacognitive performance, and round-number preferences persist under irregular ranges.
Merits
Strength
The study's systematic approach to manipulating scale design and evaluating metacognitive sensitivity provides a robust examination of the confidence scale's limitations.
Demerits
Limitation
The study's focus on a limited set of LLMs and datasets may not generalize to all LLMs and application domains.
Expert Commentary
This study represents a significant contribution to the field of AI research, as it highlights the importance of considering the design of confidence scales in LLMs. The authors' approach to manipulating scale design and evaluating metacognitive sensitivity provides a robust examination of the confidence scale's limitations. While the study's focus on a limited set of LLMs and datasets may limit its generalizability, the findings have significant implications for the development and evaluation of LLMs. The use of alternative confidence scales, such as 0-20, may improve metacognitive efficiency and reduce the risk of misinterpreting uncertainty. As AI continues to play an increasingly critical role in various domains, it is essential to consider the design of confidence scales and ensure that they are aligned with the needs of the application.
Recommendations
- ✓ Future research should investigate the use of alternative confidence scales in a wider range of LLMs and application domains.
- ✓ Developers should consider using 0-20 or other alternative confidence scales in high-stakes applications to improve metacognitive efficiency and reduce uncertainty.