Academic

On Calibration of Large Language Models: From Response To Capability

arXiv:2602.13540v1 Announce Type: new Abstract: Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation meth

arXiv:2602.13540v1 Announce Type: new Abstract: Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.

Executive Summary

The article 'On Calibration of Large Language Models: From Response To Capability' addresses the critical issue of accurate confidence estimation in large language models (LLMs). The authors argue that prior research has focused on response-level confidence, which estimates the correctness of a single generated output, but this approach is misaligned with practical settings where the overall capability of the model to solve a query is more important. The stochastic nature of modern LLM decoding means that single-response correctness does not reflect the model's underlying capability. To address this, the authors introduce capability calibration, which targets the model's expected accuracy on a query. They formally distinguish capability calibration from response calibration, demonstrating both theoretical and empirical differences. The study evaluates various confidence estimation methods and shows that capability-calibrated confidence improves pass@k prediction and inference budget allocation, laying a foundation for diverse applications.

Key Points

  • Prior work on LLM calibration focuses on response-level confidence, which may not reflect the model's overall capability.
  • The stochastic nature of LLM decoding makes single-response correctness an unreliable indicator of model capability.
  • Capability calibration is introduced to target the model's expected accuracy on a query.
  • Capability calibration is formally distinguished from response calibration, showing theoretical and empirical differences.
  • Capability-calibrated confidence improves pass@k prediction and inference budget allocation.

Merits

Novel Approach

The introduction of capability calibration addresses a significant gap in the current understanding of LLM confidence estimation, providing a more accurate and practical measure of model performance.

Rigorous Analysis

The article provides a thorough theoretical and empirical analysis, distinguishing capability calibration from response calibration and demonstrating its advantages.

Practical Applications

The findings have immediate practical implications for improving pass@k prediction and inference budget allocation, which are critical for reliable LLM deployment.

Demerits

Limited Scope

The study focuses primarily on the theoretical and empirical differences between capability and response calibration, but does not extensively explore the practical implementation challenges or the scalability of the proposed methods.

Generalizability

While the results are promising, the generalizability of the findings to different types of LLMs and various practical settings remains to be fully established.

Expert Commentary

The article presents a significant advancement in the field of LLM calibration by introducing the concept of capability calibration. The authors effectively demonstrate the limitations of response-level confidence estimation and provide a robust framework for evaluating model capability. The empirical results are compelling, showing clear improvements in pass@k prediction and inference budget allocation. However, the study could benefit from a more detailed exploration of the practical challenges associated with implementing capability calibration in real-world scenarios. Additionally, further research is needed to assess the generalizability of the findings across different types of LLMs and applications. The implications of this work are far-reaching, impacting both the practical deployment of LLMs and the broader ethical and policy considerations surrounding AI. The article sets a strong foundation for future research in this critical area.

Recommendations

  • Further empirical studies should be conducted to evaluate the scalability and generalizability of capability calibration across different LLM architectures and applications.
  • Practical guidelines should be developed for implementing capability calibration in real-world settings, addressing potential challenges and best practices.

Sources