EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models
arXiv:2603.19532v1 Announce Type: new Abstract: Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucin
arXiv:2603.19532v1 Announce Type: new Abstract: Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucinations drop nearly 5$\times$ and evidence-supported diagnoses increase from 31.8\% to 61.6\%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8\% to 67.6\% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at https://github.com/Wizaaard/EvidenceRL.git.
Executive Summary
EvidenceRL, a reinforcement learning framework, addresses the issue of hallucinations in large language models by enforcing evidence adherence during training. It scores candidate responses for grounding and correctness, and optimizes the generator using Group Relative Policy Optimization. The framework improves evidence grounding and faithfulness without sacrificing task accuracy in high-stakes domains such as cardiac diagnosis and legal reasoning. The results show significant improvement in F1 scores, grounding, and faithfulness metrics, with a substantial reduction in hallucinations. The open-sourced code allows for further development and adaptation. This framework has the potential to enhance the reliability and trustworthiness of language models in critical applications.
Key Points
- ▸ EvidenceRL is a reinforcement learning framework that enforces evidence adherence in large language models.
- ▸ It improves evidence grounding and faithfulness without sacrificing task accuracy in high-stakes domains.
- ▸ The framework demonstrates significant improvement in F1 scores, grounding, and faithfulness metrics.
Merits
Strength in Addressing Hallucinations
EvidenceRL effectively reduces hallucinations in large language models by enforcing evidence adherence, making it a significant advancement in addressing this critical issue.
Improved Performance in High-Stakes Domains
The framework demonstrates substantial improvement in F1 scores, grounding, and faithfulness metrics in high-stakes domains such as cardiac diagnosis and legal reasoning.
Open-Source Code for Further Development
The open-sourced code allows for further development, adaptation, and integration of EvidenceRL into various applications.
Demerits
Limited Evaluation Across Domains
While the framework is evaluated in two high-stakes domains, its performance and effectiveness may vary in other domains, and further evaluation is necessary to establish its generalizability.
Potential Overreliance on Retrieved Evidence
The framework's reliance on retrieved evidence may lead to overreliance on available data, potentially neglecting critical information not captured in the evidence.
Expert Commentary
EvidenceRL is a significant advancement in addressing the issue of hallucinations in large language models. By enforcing evidence adherence during training, the framework demonstrates substantial improvement in F1 scores, grounding, and faithfulness metrics. However, its limited evaluation across domains and potential overreliance on retrieved evidence are notable limitations. As the framework continues to evolve, it is essential to address these limitations and consider its broader implications for the development and deployment of AI systems in critical applications.
Recommendations
- ✓ Further evaluation of EvidenceRL across various domains and applications is necessary to establish its generalizability and effectiveness.
- ✓ Developers should consider incorporating additional measures to mitigate potential overreliance on retrieved evidence and ensure that the framework captures critical information not captured in the evidence.
Sources
Original: arXiv - cs.CL