Skip to main content
Academic

MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

arXiv:2602.21950v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity. We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case. Mirroring clinical workflow, we evaluate 18 MLLMs on differential diagnosis (DDx) generation and final diagnosis (FDx) selection. While top models often match or even outperform human experts on DDx generation, all MLLMs exhibit a much larger DDx--FDx performance gap compared to expert clinicians, indicating a failure mode in synthesis of heterogeneous CE types. Ablations attribute this failure to (i) overreliance on less discriminative textual CE ($\it{e.g.}$, medical history) and (ii) a cross-modal CE utilization gap. We introduce Evidence Sensitivity to quantify the latter and show that a smaller gap correlates

arXiv:2602.21950v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity. We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case. Mirroring clinical workflow, we evaluate 18 MLLMs on differential diagnosis (DDx) generation and final diagnosis (FDx) selection. While top models often match or even outperform human experts on DDx generation, all MLLMs exhibit a much larger DDx--FDx performance gap compared to expert clinicians, indicating a failure mode in synthesis of heterogeneous CE types. Ablations attribute this failure to (i) overreliance on less discriminative textual CE ($\it{e.g.}$, medical history) and (ii) a cross-modal CE utilization gap. We introduce Evidence Sensitivity to quantify the latter and show that a smaller gap correlates with higher diagnostic accuracy. Finally, we demonstrate how it can be used to guide interventions to improve model performance. We will open-source our benchmark and code.

Executive Summary

This article introduces MEDSYN, a novel benchmark for multimodal large language models (MLLMs) in complex clinical cases. The benchmark evaluates MLLMs on differential diagnosis (DDx) generation and final diagnosis (FDx) selection, highlighting a performance gap between DDx and FDx. Ablations attribute this gap to overreliance on less discriminative textual evidence and a cross-modal utilization gap. The article introduces Evidence Sensitivity to quantify the latter and demonstrates its correlation with diagnostic accuracy. The findings have significant implications for improving MLLM performance in clinical settings, with potential applications in medical education, clinical decision support, and healthcare policy.

Key Points

  • MEDSYN is a multilingual, multimodal benchmark for complex clinical cases
  • MLLMs exhibit a performance gap between DDx and FDx
  • Overreliance on textual evidence and cross-modal utilization gap contribute to performance gap

Merits

Comprehensive evaluation of MLLMs in clinical settings

The article provides a thorough assessment of MLLMs' capabilities and limitations in complex clinical cases, highlighting areas for improvement.

Introduction of Evidence Sensitivity

The concept of Evidence Sensitivity offers a valuable tool for quantifying the utilization gap and guiding interventions to improve model performance.

Demerits

Limited generalizability to real-world clinical scenarios

The benchmark's complexity and specificity may limit its applicability to everyday clinical practice, potentially reducing its utility in real-world settings.

Need for larger, more diverse datasets

The study's reliance on a fixed set of cases and evidence types may hinder the development of more robust and generalizable models, underscoring the importance of larger, more diverse datasets.

Expert Commentary

The article represents a significant step forward in the development and evaluation of multimodal large language models for clinical applications. By highlighting the performance gap between DDx and FDx, the study underscores the importance of careful design and training of MLLMs to ensure their effective integration into clinical workflows. The introduction of Evidence Sensitivity offers a valuable tool for quantifying the utilization gap and guiding interventions to improve model performance. However, the study's limitations, including the need for larger, more diverse datasets, highlight the importance of continued research and development in this area. As MLLMs continue to play an increasingly prominent role in clinical decision-making, it is essential that we prioritize the development of robust, generalizable models that can be trusted to support high-quality patient care.

Recommendations

  • Future studies should aim to develop more diverse and comprehensive benchmarks for MLLMs, incorporating a wider range of evidence types and clinical scenarios.
  • Researchers should prioritize the development of more robust and generalizable MLLMs, incorporating techniques such as transfer learning and multimodal fusion to improve performance and reduce performance gaps.

Sources