Academic

MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

arXiv:2603.02222v1 Announce Type: new Abstract: MedCalc-Bench is a widely used benchmark for evaluating LLM performance on clinical calculator tasks, with state-of-the-art direct prompting scores plateauing around 35% on the Verified split (HELM MedHELM leaderboard) and the best published approach-RL with verifiable rewards-reaching 74%. We present three contributions that challenge the benchmark's current framing. First, we conduct a systematic audit of the benchmark's calculator implementations, identifying and fixing over 20 errors ranging from critical formula inaccuracies to runtime bugs in a NeurIPS-published dataset. Second, we show that a simple intervention-providing the model with the calculator specification at inference time ("open-book" prompting)-raises accuracy from ~52% to 81-85% on GLM-4.6V and GLM-4.7, surpassing all published results including RL-trained systems, without any fine-tuning. Third, we establish an upper bound of 95-97% using GPT-5.2-Th

A
Artus Krohn-Grimberghe
· · 1 min read · 2 views

arXiv:2603.02222v1 Announce Type: new Abstract: MedCalc-Bench is a widely used benchmark for evaluating LLM performance on clinical calculator tasks, with state-of-the-art direct prompting scores plateauing around 35% on the Verified split (HELM MedHELM leaderboard) and the best published approach-RL with verifiable rewards-reaching 74%. We present three contributions that challenge the benchmark's current framing. First, we conduct a systematic audit of the benchmark's calculator implementations, identifying and fixing over 20 errors ranging from critical formula inaccuracies to runtime bugs in a NeurIPS-published dataset. Second, we show that a simple intervention-providing the model with the calculator specification at inference time ("open-book" prompting)-raises accuracy from ~52% to 81-85% on GLM-4.6V and GLM-4.7, surpassing all published results including RL-trained systems, without any fine-tuning. Third, we establish an upper bound of 95-97% using GPT-5.2-Thinking, with residual errors attributable primarily to ground-truth issues and dataset ambiguities. Our findings suggest that MedCalc-Bench predominantly measures formula memorization and arithmetic precision rather than clinical reasoning, and would be better framed as a tool-use evaluation.

Executive Summary

This article presents a comprehensive critique of the widely used MedCalc-Bench benchmark for evaluating Large Language Model (LLM) performance on clinical calculator tasks. The authors conduct a systematic audit of the benchmark's calculator implementations, identifying and fixing over 20 errors. They also demonstrate that providing the model with the calculator specification at inference time ('open-book' prompting) significantly improves accuracy, surpassing all published results. The findings suggest that MedCalc-Bench predominantly measures formula memorization and arithmetic precision rather than clinical reasoning. The authors propose reframing the benchmark as a tool-use evaluation. This study has significant implications for the development and evaluation of LLMs in clinical domains, highlighting the need for more nuanced and accurate benchmarking methods.

Key Points

  • MedCalc-Bench contains critical errors and inaccuracies in its calculator implementations
  • Providing the model with the calculator specification at inference time ('open-book' prompting) significantly improves accuracy
  • The current framing of MedCalc-Bench prioritizes formula memorization and arithmetic precision over clinical reasoning

Merits

Strength in methodology

The authors employ a rigorous and systematic approach to auditing the MedCalc-Bench benchmark, ensuring the accuracy and reliability of their findings.

Significance of results

The study's findings have far-reaching implications for the development and evaluation of LLMs in clinical domains, highlighting the need for more nuanced and accurate benchmarking methods.

Proposed framework for tool-use evaluation

The authors' proposal for reframing MedCalc-Bench as a tool-use evaluation is a novel and thought-provoking approach, which could lead to more comprehensive and accurate assessments of LLM performance.

Demerits

Limitation in scope

The study's focus on MedCalc-Bench may limit its generalizability to other benchmarks and evaluation frameworks, which could be potentially affected by similar issues.

Dependence on dataset quality

The accuracy of the study's results may be influenced by the quality and characteristics of the datasets used, which could introduce bias and variability.

Expert Commentary

This study is a significant contribution to the field of LLMs, highlighting the limitations and potential biases of current benchmarking methods. The authors' proposal for reframing MedCalc-Bench as a tool-use evaluation is a thought-provoking and innovative approach, which could lead to more comprehensive and accurate assessments of LLM performance. However, the study's focus on MedCalc-Bench may limit its generalizability to other benchmarks and evaluation frameworks, which could be potentially affected by similar issues. Furthermore, the accuracy of the study's results may be influenced by the quality and characteristics of the datasets used, which could introduce bias and variability.

Recommendations

  • Develop more nuanced and accurate benchmarking methods, which capture the complexities of clinical reasoning and decision-making in LLMs.
  • Investigate the potential biases and limitations of other benchmarks and evaluation frameworks, and develop strategies for mitigation and improvement.

Sources