From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMs
arXiv:2604.05348v1 Announce Type: new Abstract: Hallucinations in medical large language models (LLMs) remain a safety-critical issue, particularly when available evidence is insufficient or conflicting. We study this problem in diabetic retinopathy (DR) decision settings and introduce RETINA-SAFE, an evidence-grounded benchmark aligned with retinal grading records, comprising 12,522 samples. RETINA-SAFE is organized into three evidence-relation tasks: E-Align (evidence-consistent), E-Conflict (evidence-conflicting), and E-Gap (evidence-insufficient). We further propose ECRT (Evidence-Conditioned Risk Triage), a two-stage white-box detection framework: Stage 1 performs Safe/Unsafe risk triage, and Stage 2 refines unsafe cases into contradiction-driven versus evidence-gap risks. ECRT leverages internal representation and logit shifts under CTX/NOCTX conditions, with class-balanced training for robust learning. Under evidence-grouped (not patient-disjoint) splits across multiple backbon
arXiv:2604.05348v1 Announce Type: new Abstract: Hallucinations in medical large language models (LLMs) remain a safety-critical issue, particularly when available evidence is insufficient or conflicting. We study this problem in diabetic retinopathy (DR) decision settings and introduce RETINA-SAFE, an evidence-grounded benchmark aligned with retinal grading records, comprising 12,522 samples. RETINA-SAFE is organized into three evidence-relation tasks: E-Align (evidence-consistent), E-Conflict (evidence-conflicting), and E-Gap (evidence-insufficient). We further propose ECRT (Evidence-Conditioned Risk Triage), a two-stage white-box detection framework: Stage 1 performs Safe/Unsafe risk triage, and Stage 2 refines unsafe cases into contradiction-driven versus evidence-gap risks. ECRT leverages internal representation and logit shifts under CTX/NOCTX conditions, with class-balanced training for robust learning. Under evidence-grouped (not patient-disjoint) splits across multiple backbones, ECRT provides strong Stage-1 risk triage and explicit subtype attribution, improves Stage-1 balanced accuracy by +0.15 to +0.19 over external uncertainty and self-consistency baselines and by +0.02 to +0.07 over the strongest adapted supervised baseline, and consistently exceeds a single-stage white-box ablation on Stage-1 balanced accuracy. These findings support white-box internal signals grounded in retinal evidence as a practical route to interpretable medical LLM risk triage.
Executive Summary
The article presents RETINA-SAFE, an evidence-grounded benchmark comprising 12,522 samples designed to evaluate hallucination risks in medical LLMs within diabetic retinopathy (DR) decision contexts. The benchmark introduces three evidence-relation tasks—E-Align, E-Conflict, and E-Gap—to simulate evidence-consistent, conflicting, and insufficient scenarios. Complementing this, the authors propose ECRT, a two-stage white-box detection framework that performs risk triage and refines unsafe cases into contradiction-driven or evidence-gap risks. ECRT leverages internal representation and logit shifts under contextualized and non-contextualized conditions, demonstrating significant improvements in balanced accuracy for risk triage compared to external uncertainty, self-consistency baselines, and adapted supervised baselines. The findings underscore the potential of white-box internal signals grounded in retinal evidence as a practical and interpretable approach to medical LLM risk assessment.
Key Points
- ▸ RETINA-SAFE introduces a novel, evidence-grounded benchmark for evaluating hallucination risks in medical LLMs, specifically tailored to diabetic retinopathy (DR) decision settings.
- ▸ The benchmark organizes tasks into three evidence-relation categories: E-Align (consistent evidence), E-Conflict (conflicting evidence), and E-Gap (insufficient evidence), providing a nuanced framework for assessing LLM performance.
- ▸ ECRT, a two-stage white-box detection framework, outperforms external uncertainty, self-consistency, and adapted supervised baselines in balanced accuracy for risk triage, highlighting the efficacy of internal representation signals in medical LLM risk assessment.
Merits
Novel Benchmark Design
RETINA-SAFE offers a rigorously structured benchmark that systematically evaluates hallucination risks in medical LLMs across three evidence-relation tasks, addressing a critical gap in current evaluation methodologies.
White-Box Detection Framework
ECRT leverages internal representation and logit shifts under contextualized conditions, providing interpretable and robust risk triage compared to black-box or external uncertainty baselines.
Empirical Robustness
The study demonstrates consistent improvements in balanced accuracy (+0.15 to +0.19 over baselines) and explicit subtype attribution, suggesting strong practical applicability in clinical decision-support scenarios.
Demerits
Limited Generalizability
The benchmark and framework are tailored to diabetic retinopathy (DR) decision settings, raising questions about their applicability to other medical domains or conditions without significant adaptation.
Evidence-Grouped Splits Constraint
The study uses evidence-grouped (not patient-disjoint) splits, which may introduce bias or overfitting risks, as patient-level generalization is not explicitly evaluated.
Computational Overhead
The white-box approach, while interpretable, may impose computational overhead in real-time clinical settings, particularly when integrating with large-scale LLMs.
Expert Commentary
This article represents a significant advancement in the evaluation and mitigation of hallucination risks in medical LLMs, particularly within the high-stakes domain of diabetic retinopathy (DR) diagnostics. The introduction of RETINA-SAFE addresses a critical void in current benchmarks by systematically categorizing evidence relations—consistent, conflicting, and insufficient—thereby providing a more granular assessment of LLM performance. The ECRT framework’s two-stage approach, leveraging internal representation signals, offers a compelling alternative to black-box or external uncertainty baselines, demonstrating superior balanced accuracy and explicit subtype attribution. However, the study’s reliance on evidence-grouped splits and domain-specific tailoring may limit its immediate generalizability. Future work should explore cross-domain validation and patient-disjoint evaluations to ensure broader applicability. Nonetheless, the work’s emphasis on interpretability and clinical relevance positions it as a pivotal contribution to the intersection of AI safety and healthcare, with potential implications for regulatory frameworks and clinical practice.
Recommendations
- ✓ Expand RETINA-SAFE to include multi-domain datasets to validate the generalizability of the benchmark and ECRT framework beyond diabetic retinopathy.
- ✓ Conduct patient-disjoint evaluations to assess the framework’s robustness in real-world clinical scenarios, where patient-level generalization is essential.
- ✓ Explore hybrid approaches that combine white-box internal signals with external validation mechanisms to enhance both interpretability and reliability in clinical deployments.
Sources
Original: arXiv - cs.AI