Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs
arXiv:2603.12458v1 Announce Type: cross Abstract: While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is "shortcut learning", where models exploit highly connected, generic hub nodes (e.g., "inflammation") in knowledge graphs to bypass authentic micro-pathological cascades. To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel $k$-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate bio
arXiv:2603.12458v1 Announce Type: cross Abstract: While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is "shortcut learning", where models exploit highly connected, generic hub nodes (e.g., "inflammation") in knowledge graphs to bypass authentic micro-pathological cascades. To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel $k$-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination. Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA's structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: https://shattermed-qa-web.vercel.app/
Executive Summary
The article 'Shattering the Shortcut' introduces ShatterMed-QA, a novel benchmark addressing a critical gap in medical LLM capabilities—multi-hop diagnostic reasoning. While LLMs excel in single-hop factual recall, they consistently fail to navigate complex clinical pathways due to 'shortcut learning,' wherein models exploit generic hub nodes to bypass authentic pathological cascades. ShatterMed-QA combats this by deploying a topology-regularized Knowledge Graph via a $k$-Shattering algorithm that physically removes generic hubs, thereby disrupting logical shortcuts. The benchmark comprises 10,558 multi-hop questions, incorporating implicit bridge masking and topology-driven hard sampling to enforce biologically plausible reasoning. Evaluations across 21 LLMs demonstrate significant declines in performance on multi-hop tasks, particularly for domain-specific models, validating the benchmark’s effectiveness. Importantly, RAG-based evidence restoration reverses this degradation, substantiating the structural integrity of the benchmark. This work represents a pivotal advancement in diagnosing and mitigating reasoning deficiencies in medical AI.
Key Points
- ▸ Introduction of ShatterMed-QA as a multi-hop clinical reasoning benchmark
- ▸ Use of $k$-Shattering algorithm to prune generic hubs and disrupt shortcut learning
- ▸ Validation via performance degradation metrics and RAG recovery effect
Merits
Structural Innovation
The $k$-Shattering algorithm introduces a novel, topology-driven mechanism to systematically dismantle shortcut learning without compromising general knowledge integrity.
Empirical Validation
The benchmark’s effectiveness is empirically substantiated through controlled evaluations across 21 LLMs, demonstrating measurable and significant performance degradation on multi-hop tasks.
Demerits
Scalability Concern
The $k$-Shattering algorithm’s computational overhead may limit scalability for real-time clinical AI applications or large-scale knowledge graph updates.
Generalizability Limitation
While the benchmark excels in diagnosing reasoning deficits, its focus on bilingual medical data may restrict applicability to non-English or region-specific clinical contexts without adaptation.
Expert Commentary
This paper represents a landmark contribution to the intersection of medical informatics and AI. The conceptualization of 'shortcut learning' as a systemic cognitive bias in LLMs—rather than a minor performance anomaly—is a paradigm shift. The $k$-Shattering algorithm’s design reflects a deep understanding of both graph theory and clinical logic: by removing generic hubs, it does not merely filter noise but reconstructs the ontological integrity of diagnostic pathways. The use of implicit bridge masking and hard negative sampling is particularly elegant, as it transforms evaluation from a content-based test into a cognitive simulation of real-world diagnostic decision-making. Crucially, the RAG recovery effect is not a trivial observation; it is a confirmation that the deficit is structural, not semantic—meaning that with appropriate retrieval augmentation, even current models can be recalibrated. This dual insight—diagnosing the problem and offering a viable remediation—elevates ShatterMed-QA from a benchmark to a diagnostic tool for AI medicine itself. The implications extend beyond evaluation: this framework may inspire analogous structural interventions in other complex domains, from legal reasoning to financial forecasting.
Recommendations
- ✓ 2. Extend the $k$-Shattering methodology to other high-stakes domains (e.g., legal, financial) where shortcut learning presents analogous risks to decision integrity.