Skip to main content
Academic

Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

arXiv:2602.16050v1 Announce Type: new Abstract: Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most dif

arXiv:2602.16050v1 Announce Type: new Abstract: Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.

Executive Summary

This study presents a novel approach to subspecialty clinical reasoning using an evidence-grounded clinical intelligence layer, referred to as January Mirror. Mirror integrates a curated clinical evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. In a head-to-head comparison with frontier large language models (LLMs) on a 2025 endocrinology board-style examination, Mirror demonstrated superior performance, achieving 87.5% accuracy and outperforming both human reference and comparator LLMs. Notably, Mirror's performance was particularly strong on the most difficult questions, with 76.7% accuracy. The study highlights the potential of curated evidence with explicit provenance to support auditability and improve clinical decision-making.

Key Points

  • January Mirror, an evidence-grounded clinical intelligence layer, outperformed both human reference and comparator LLMs on a 2025 endocrinology board-style examination.
  • Mirror achieved superior performance, with 87.5% accuracy and strong results on the most difficult questions.
  • The study highlights the potential of curated evidence with explicit provenance to support auditability and improve clinical decision-making.

Merits

Strengths in Subspecialty Clinical Reasoning

January Mirror's performance in subspecialty clinical reasoning demonstrates the potential for evidence-grounded clinical intelligence layers to improve clinical decision-making. Mirror's ability to outperform comparator LLMs and human reference highlights the value of curated evidence with explicit provenance.

Evidence Traceability

Mirror's ability to provide evidence traceability, with 74.2% of outputs citing at least one guideline-tier source, supports auditability and improves the transparency of clinical decision-making.

Demerits

Limited Generalizability

The study's focus on endocrinology and cardiometabolic evidence hierarchies may limit the generalizability of the findings to other medical specialties and domains.

Dependence on Curated Evidence

Mirror's performance relies on the quality and relevance of the curated evidence corpus, which may not be readily available or up-to-date for all medical specialties.

Expert Commentary

The study presents a novel approach to subspecialty clinical reasoning that leverages the potential of curated evidence with explicit provenance to support auditability and improve clinical decision-making. The findings are significant, particularly in medical specialties with rapidly evolving guidelines and nuanced evidence hierarchies. However, the study's limitations, including the dependence on curated evidence and limited generalizability, highlight the need for ongoing research and development in this area. The implications of the study are far-reaching, with potential benefits for clinical decision-making, patient outcomes, and policy development.

Recommendations

  • Future research should focus on developing and testing evidence-grounded clinical intelligence layers in other medical specialties and domains.
  • Developing standards for clinical evidence synthesis and knowledge translation is critical to support the development and deployment of evidence-grounded clinical intelligence layers.

Sources