Skip to main content
Academic

ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making

arXiv:2602.22771v1 Announce Type: new Abstract: Clinical decisions are often required under incomplete information. Clinical experts must identify whether available information is sufficient for judgment, as both premature conclusion and unnecessary abstention can compromise patient safety. To evaluate this capability of large language models (LLMs), we developed ClinDet-Bench, a benchmark based on clinical scoring systems that decomposes incomplete-information scenarios into determinable and undeterminable conditions. Identifying determinability requires considering all hypotheses about missing information, including unlikely ones, and verifying whether the conclusion holds across them. We find that recent LLMs fail to identify determinability under incomplete information, producing both premature judgments and excessive abstention, despite correctly explaining the underlying scoring knowledge and performing well under complete information. These findings suggest that existing benchm

arXiv:2602.22771v1 Announce Type: new Abstract: Clinical decisions are often required under incomplete information. Clinical experts must identify whether available information is sufficient for judgment, as both premature conclusion and unnecessary abstention can compromise patient safety. To evaluate this capability of large language models (LLMs), we developed ClinDet-Bench, a benchmark based on clinical scoring systems that decomposes incomplete-information scenarios into determinable and undeterminable conditions. Identifying determinability requires considering all hypotheses about missing information, including unlikely ones, and verifying whether the conclusion holds across them. We find that recent LLMs fail to identify determinability under incomplete information, producing both premature judgments and excessive abstention, despite correctly explaining the underlying scoring knowledge and performing well under complete information. These findings suggest that existing benchmarks are insufficient to evaluate the safety of LLMs in clinical settings. ClinDet-Bench provides a framework for evaluating determinability recognition, leading to appropriate abstention, with potential applicability to medicine and other high-stakes domains, and is publicly available.

Executive Summary

This article introduces ClinDet-Bench, a novel benchmark designed to evaluate the judgment determinability of large language models (LLMs) in clinical decision-making. By decomposing incomplete-information scenarios, ClinDet-Bench assesses LLMs' ability to identify determinable and undeterminable conditions, recognizing the importance of appropriate abstention in clinical settings. The authors find that recent LLMs perform poorly in such scenarios, producing premature judgments and excessive abstention despite correctly explaining underlying scoring knowledge. This highlights the insufficiency of existing benchmarks in evaluating LLM safety in clinical contexts. ClinDet-Bench offers a framework for determinability recognition, with potential applications in medicine and other high-stakes domains. The authors' findings underscore the need for more comprehensive evaluation methods, emphasizing the critical role of LLMs in clinical decision-making.

Key Points

  • ClinDet-Bench is a benchmark designed to evaluate LLMs' judgment determinability in clinical decision-making.
  • Recent LLMs perform poorly in identifying determinable and undeterminable conditions under incomplete information.
  • Existing benchmarks are insufficient to evaluate LLM safety in clinical settings.

Merits

Strength

The development of ClinDet-Bench offers a novel approach to evaluating LLMs' judgment determinability, addressing a critical gap in existing benchmarks.

Methodological innovation

The authors' use of clinical scoring systems and decomposed incomplete-information scenarios provides a rigorous framework for assessing LLM performance in clinical contexts.

Potential applicability

ClinDet-Bench may have implications for medicine and other high-stakes domains, underscoring its significance in the field of AI-assisted decision-making.

Demerits

Limitation

The study's focus on LLMs may limit its generalizability to other types of AI models or decision-making systems.

Scalability

The authors do not address potential scalability issues with ClinDet-Bench, which may impact its practical application in real-world clinical settings.

Contextual dependence

The findings may be context-dependent, requiring further exploration of the benchmark's performance in various clinical scenarios and settings.

Expert Commentary

The article's contributions are significant, addressing a critical gap in existing benchmarks for evaluating LLMs in clinical decision-making. However, the study's limitations, such as scalability and contextual dependence, should be addressed in future research. The findings have far-reaching implications for the field of AI-assisted decision-making, emphasizing the need for more comprehensive evaluation methods and transparency in AI systems. As the use of LLMs in clinical settings continues to grow, ClinDet-Bench provides a crucial framework for assessing their safety and effectiveness, underscoring the importance of collaboration between researchers, clinicians, and policymakers.

Recommendations

  • Future research should focus on addressing the limitations of ClinDet-Bench, including scalability and contextual dependence, to ensure its practical application in real-world clinical settings.
  • Developing more robust evaluation methods and benchmarks for LLMs in clinical decision-making is essential, requiring collaboration between researchers, clinicians, and policymakers.

Sources