The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition
arXiv:2603.18294v1 Announce Type: new Abstract: Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) …
Alvin Rajkomar, Pavan Sudarshan, Angela Lai, Lily Peng
12 views