Academic

From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMs

Zhe Yu, Wenpeng Xing, Meng Han · April 8, 2026 · 1 min read · 45 views

#cs.AI

arXiv:2604.05348v1 Announce Type: new Abstract: Hallucinations in medical large language models (LLMs) remain a safety-critical issue, particularly when available evidence is insufficient or conflicting. We study this problem in diabetic retinopathy (DR) decision settings and introduce RETINA-SAFE, an evidence-grounded benchmark aligned with retinal grading records, comprising 12,522 samples. RETINA-SAFE is organized into three evidence-relation tasks: E-Align (evidence-consistent), E-Conflict (evidence-conflicting), and E-Gap (evidence-insufficient). We further propose ECRT (Evidence-Conditioned Risk Triage), a two-stage white-box detection framework: Stage 1 performs Safe/Unsafe risk triage, and Stage 2 refines unsafe cases into contradiction-driven versus evidence-gap risks. ECRT leverages internal representation and logit shifts under CTX/NOCTX conditions, with class-balanced training for robust learning. Under evidence-grouped (not patient-disjoint) splits across multiple backbones, ECRT provides strong Stage-1 risk triage and explicit subtype attribution, improves Stage-1 balanced accuracy by +0.15 to +0.19 over external uncertainty and self-consistency baselines and by +0.02 to +0.07 over the strongest adapted supervised baseline, and consistently exceeds a single-stage white-box ablation on Stage-1 balanced accuracy. These findings support white-box internal signals grounded in retinal evidence as a practical route to interpretable medical LLM risk triage.

Executive Summary

The article presents RETINA-SAFE, an evidence-grounded benchmark comprising 12,522 samples designed to evaluate hallucination risks in medical LLMs within diabetic retinopathy (DR) decision contexts. The benchmark introduces three evidence-relation tasks—E-Align, E-Conflict, and E-Gap—to simulate evidence-consistent, conflicting, and insufficient scenarios. Complementing this, the authors propose ECRT, a two-stage white-box detection framework that performs risk triage and refines unsafe cases into contradiction-driven or evidence-gap risks. ECRT leverages internal representation and logit shifts under contextualized and non-contextualized conditions, demonstrating significant improvements in balanced accuracy for risk triage compared to external uncertainty, self-consistency baselines, and adapted supervised baselines. The findings underscore the potential of white-box internal signals grounded in retinal evidence as a practical and interpretable approach to medical LLM risk assessment.

Key Points

▸ RETINA-SAFE introduces a novel, evidence-grounded benchmark for evaluating hallucination risks in medical LLMs, specifically tailored to diabetic retinopathy (DR) decision settings.
▸ The benchmark organizes tasks into three evidence-relation categories: E-Align (consistent evidence), E-Conflict (conflicting evidence), and E-Gap (insufficient evidence), providing a nuanced framework for assessing LLM performance.
▸ ECRT, a two-stage white-box detection framework, outperforms external uncertainty, self-consistency, and adapted supervised baselines in balanced accuracy for risk triage, highlighting the efficacy of internal representation signals in medical LLM risk assessment.

Merits

Novel Benchmark Design

RETINA-SAFE offers a rigorously structured benchmark that systematically evaluates hallucination risks in medical LLMs across three evidence-relation tasks, addressing a critical gap in current evaluation methodologies.

White-Box Detection Framework

ECRT leverages internal representation and logit shifts under contextualized conditions, providing interpretable and robust risk triage compared to black-box or external uncertainty baselines.

Empirical Robustness

The study demonstrates consistent improvements in balanced accuracy (+0.15 to +0.19 over baselines) and explicit subtype attribution, suggesting strong practical applicability in clinical decision-support scenarios.

Demerits

Limited Generalizability

The benchmark and framework are tailored to diabetic retinopathy (DR) decision settings, raising questions about their applicability to other medical domains or conditions without significant adaptation.

Evidence-Grouped Splits Constraint

The study uses evidence-grouped (not patient-disjoint) splits, which may introduce bias or overfitting risks, as patient-level generalization is not explicitly evaluated.

Computational Overhead

The white-box approach, while interpretable, may impose computational overhead in real-time clinical settings, particularly when integrating with large-scale LLMs.

Expert Commentary

This article represents a significant advancement in the evaluation and mitigation of hallucination risks in medical LLMs, particularly within the high-stakes domain of diabetic retinopathy (DR) diagnostics. The introduction of RETINA-SAFE addresses a critical void in current benchmarks by systematically categorizing evidence relations—consistent, conflicting, and insufficient—thereby providing a more granular assessment of LLM performance. The ECRT framework’s two-stage approach, leveraging internal representation signals, offers a compelling alternative to black-box or external uncertainty baselines, demonstrating superior balanced accuracy and explicit subtype attribution. However, the study’s reliance on evidence-grouped splits and domain-specific tailoring may limit its immediate generalizability. Future work should explore cross-domain validation and patient-disjoint evaluations to ensure broader applicability. Nonetheless, the work’s emphasis on interpretability and clinical relevance positions it as a pivotal contribution to the intersection of AI safety and healthcare, with potential implications for regulatory frameworks and clinical practice.

Recommendations

✓ Expand RETINA-SAFE to include multi-domain datasets to validate the generalizability of the benchmark and ECRT framework beyond diabetic retinopathy.
✓ Conduct patient-disjoint evaluations to assess the framework’s robustness in real-world clinical scenarios, where patient-level generalization is essential.
✓ Explore hybrid approaches that combine white-box internal signals with external validation mechanisms to enhance both interpretability and reliability in clinical deployments.

Sources

Original: arXiv - cs.AI

arXiv - cs.AI

From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMs

AI Commentary

Executive Summary

Key Points

Merits

Novel Benchmark Design

White-Box Detection Framework

Empirical Robustness

Demerits

Limited Generalizability

Evidence-Grouped Splits Constraint

Computational Overhead

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs