Skip to main content
Academic

A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

arXiv:2602.14158v1 Announce Type: new Abstract: Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured expla

arXiv:2602.14158v1 Announce Type: new Abstract: Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high-risk or high-uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity-based uncertainty scoring, plus lexical and sentiment-based bias detection supported by LIME/SHAP-based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end-to-end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.

Executive Summary

The article presents a multi-agent framework for medical AI, combining fine-tuned GPT, LLaMA, and DeepSeek R1 models to enhance evidence-based and bias-aware clinical query processing. The framework includes a Clinical Reasoning agent, an Evidence Retrieval agent, and a Refinement agent, each specialized to improve the reliability and accuracy of medical question-answering systems. The study demonstrates that the full system achieves 87% accuracy with relevant evidence grounding and reduced uncertainty, addressing key limitations of single-model approaches. The framework also incorporates safety mechanisms such as uncertainty estimation and bias detection, making it a practical and extensible design for medical AI applications.

Key Points

  • Fine-tuning of GPT, LLaMA, and DeepSeek R1 models on MedQuAD-derived medical QA data.
  • Implementation of a modular multi-agent pipeline for clinical reasoning, evidence retrieval, and response refinement.
  • Incorporation of safety mechanisms like Monte Carlo dropout, perplexity-based uncertainty scoring, and bias detection.
  • Achievement of 87% accuracy with evidence augmentation reducing uncertainty and improving response reliability.

Merits

Comprehensive Framework

The multi-agent framework effectively addresses the limitations of single-model approaches by combining specialized agents for different tasks, enhancing the overall reliability and accuracy of medical AI systems.

Evidence-Based Responses

The integration of an Evidence Retrieval agent ensures that responses are grounded in recent literature, improving the factual consistency and reliability of the system's outputs.

Bias and Uncertainty Mitigation

The inclusion of bias detection and uncertainty estimation mechanisms makes the framework more robust and trustworthy for clinical applications.

Demerits

High Latency

The mean end-to-end latency of 36.5 seconds, while acceptable for some applications, may be too slow for real-time clinical decision-making scenarios.

Complexity

The complexity of the multi-agent framework may pose challenges for implementation and maintenance, requiring significant computational resources and expertise.

Limited Evaluation Scope

The evaluation is based on a specific dataset and configuration, which may not fully capture the performance of the system in diverse real-world clinical settings.

Expert Commentary

The proposed multi-agent framework for medical AI represents a significant advancement in the field, addressing critical limitations of single-model approaches. By combining specialized agents for clinical reasoning, evidence retrieval, and response refinement, the framework enhances the reliability and accuracy of medical question-answering systems. The inclusion of safety mechanisms such as bias detection and uncertainty estimation further strengthens the system's robustness, making it more trustworthy for clinical applications. However, the high latency and complexity of the framework pose challenges for real-time implementation and maintenance. Future research should focus on optimizing the system for faster response times and reducing the computational overhead. Additionally, broader evaluations in diverse clinical settings would provide a more comprehensive understanding of the system's performance. Overall, the framework sets a strong foundation for evidence-based and bias-aware medical AI, paving the way for more reliable and ethical AI applications in healthcare.

Recommendations

  • Further optimization of the multi-agent framework to reduce latency and improve real-time performance.
  • Conducting broader evaluations in diverse clinical settings to assess the system's performance and reliability in real-world scenarios.
  • Exploring the integration of additional safety mechanisms and ethical considerations to ensure the system's compliance with healthcare regulations and standards.

Sources