MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation
arXiv:2603.02222v1 Announce Type: new Abstract: MedCalc-Bench is a widely used benchmark for evaluating LLM performance on clinical calculator tasks, with state-of-the-art direct prompting scores plateauing …
Artus Krohn-Grimberghe
3 views