Academic

Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

arXiv:2603.04421v1 Announce Type: new Abstract: Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that indiv

arXiv:2603.04421v1 Announce Type: new Abstract: Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.

Executive Summary

This study investigates the impact of vendor diversity on the performance of multi-agent large language model (LLM) systems in clinical diagnosis. The authors compare Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks, using three different LLMs from distinct vendors. The results show that mixed-vendor configurations outperform their single-vendor counterparts, achieving state-of-the-art recall and accuracy. The study highlights the importance of vendor diversity in clinical diagnostic systems, as it enables the pooling of complementary inductive biases and surfacing of correct diagnoses that individual models or homogeneous teams may miss. The findings have significant implications for the development of robust clinical diagnostic systems.

Key Points

  • The study shows that mixed-vendor configurations outperform single-vendor counterparts in clinical diagnosis
  • Vendor diversity enables the pooling of complementary inductive biases
  • The study highlights the importance of vendor diversity in clinical diagnostic systems

Merits

Strength in Experimental Design

The study employs a robust experimental design, comparing Single-LLM, Single-Vendor, and Mixed-Vendor MAC frameworks, and using three distinct LLMs from different vendors.

Insightful Analysis of Results

The authors provide a thorough analysis of the results, highlighting the importance of vendor diversity and the underlying mechanism of complementary inductive biases.

Demerits

Limited Generalizability

The study's findings may not be generalizable to other domains or applications, as the focus is on clinical diagnosis and the specific LLMs used.

Lack of Real-World Validation

The study's results are based on simulated data and may not reflect real-world performance, highlighting the need for further validation and testing.

Expert Commentary

The study provides a valuable contribution to the field of clinical diagnosis and AI-powered systems. The authors' focus on vendor diversity and the importance of complementary inductive biases is well-supported by the results. However, the study's limitations, including limited generalizability and lack of real-world validation, highlight the need for further research and testing. The findings have significant implications for the development of robust clinical diagnostic systems, and the study's insights on vendor diversity and bias in AI systems are relevant to broader discussions in the field.

Recommendations

  • Future studies should investigate the generalizability of the study's findings to other domains and applications.
  • Researchers should prioritize the validation and testing of AI-powered clinical diagnostic systems in real-world settings to better understand their performance and limitations.

Sources