Academic

Confidence Should Be Calibrated More Than One Turn Deep

arXiv:2604.05397v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propo

arXiv:2604.05397v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality and consistency of the model response in multi-turn interactions. Extensive experiments demonstrate that MT-Cal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves and even enhances model performance in multi-turn interactions. Our results mark multi-turn calibration as one missing link for scaling LLM calibration toward safe, reliable, and real-world use.

Executive Summary

This paper addresses a critical gap in the reliability of Large Language Models (LLMs) by shifting the focus from single-turn confidence calibration to multi-turn calibration. The authors argue that existing calibration methods fail to account for the dynamic nature of real-world interactions, where user feedback and conversation history can distort model confidence over time. They introduce the ECE@T metric to quantify calibration drift and propose two novel contributions: MTCal, a framework to minimize calibration error across turns, and ConfChat, a decoding strategy leveraging calibrated confidence to enhance factuality and consistency. The study demonstrates that multi-turn calibration is essential for scaling LLMs to high-stakes applications, marking a significant advancement in trustworthy AI.

Key Points

  • Multi-turn calibration is essential for reliable LLM interactions in high-stakes domains, as single-turn approaches overlook dynamic user feedback effects.
  • The introduction of ECE@T (Expected Calibration Error at turn T) reveals that user persuasion can degrade calibration accuracy over successive turns, posing risks to model reliability.
  • MTCal and ConfChat address these challenges by dynamically adjusting confidence levels and leveraging calibrated responses to improve factuality and consistency in multi-turn conversations.

Merits

Novelty of Problem Framing

The paper uniquely shifts the calibration paradigm from static single-turn to dynamic multi-turn settings, addressing a previously underexplored area critical for real-world LLM deployment.

Methodological Rigor

The introduction of ECE@T and the MTCal framework provides a rigorous, empirically validated approach to quantifying and mitigating calibration drift in multi-turn interactions.

Practical Impact

ConfChat demonstrates tangible improvements in factuality and consistency, offering actionable solutions for deploying LLMs in safety-critical applications such as healthcare, finance, and education.

Demerits

Limited Generalizability of Findings

The study primarily evaluates calibration performance in controlled experimental settings; further validation across diverse domains and user populations is needed to confirm robustness.

Computational Overhead

Dynamic calibration (MTCal) and decoding (ConfChat) may introduce additional computational costs, potentially limiting scalability for resource-constrained deployments.

User Feedback Variability

The analysis assumes user feedback is measurable and predictable, but real-world interactions may involve noisy, ambiguous, or adversarial inputs that challenge calibration mechanisms.

Expert Commentary

This paper makes a seminal contribution to the field of trustworthy AI by exposing a critical flaw in existing calibration approaches: the assumption that confidence remains static across interactions. The introduction of ECE@T and the dynamic frameworks (MTCal and ConfChat) are particularly noteworthy, as they address a gap that has long undermined the reliability of LLMs in real-world scenarios. For instance, in healthcare, where user feedback may include subtle cues of doubt or disagreement, a model’s uncalibrated confidence could lead to misdiagnosis or inappropriate treatment recommendations. The authors’ emphasis on multi-turn dynamics aligns with broader trends in human-AI interaction research, where adaptability to user intent is paramount. However, the paper could benefit from deeper exploration of adversarial scenarios, where malicious users might intentionally manipulate calibration mechanisms. Additionally, while the empirical results are compelling, the scalability of these methods in large-scale deployments remains an open question. Overall, the work represents a paradigm shift in calibration research and sets a new benchmark for future studies in reliable conversational AI.

Recommendations

  • Develop standardized benchmarks for multi-turn calibration that include adversarial and edge-case scenarios to test robustness.
  • Integrate MTCal and ConfChat into open-source LLM frameworks to foster community adoption and accelerate real-world validation.
  • Collaborate with domain experts (e.g., clinicians, financial analysts) to tailor calibration mechanisms to specific high-stakes applications.

Sources

Original: arXiv - cs.CL