Academic

When Does Multimodal Learning Help in Healthcare? A Benchmark on EHR and Chest X-Ray Fusion

arXiv:2602.23614v1 Announce Type: new Abstract: Machine learning holds promise for advancing clinical decision support, yet it remains unclear when multimodal learning truly helps in practice, particularly under modality missingness and fairness constraints. In this work, we conduct a systematic benchmark of multimodal fusion between Electronic Health Records (EHR) and chest X-rays (CXR) on standardized cohorts from MIMIC-IV and MIMIC-CXR, aiming to answer four fundamental questions: when multimodal fusion improves clinical prediction, how different fusion strategies compare, how robust existing methods are to missing modalities, and whether multimodal models achieve algorithmic fairness. Our study reveals several key insights. Multimodal fusion improves performance when modalities are complete, with gains concentrating in diseases that require complementary information from both EHR and CXR. While cross-modal learning mechanisms capture clinically meaningful dependencies beyond simpl

arXiv:2602.23614v1 Announce Type: new Abstract: Machine learning holds promise for advancing clinical decision support, yet it remains unclear when multimodal learning truly helps in practice, particularly under modality missingness and fairness constraints. In this work, we conduct a systematic benchmark of multimodal fusion between Electronic Health Records (EHR) and chest X-rays (CXR) on standardized cohorts from MIMIC-IV and MIMIC-CXR, aiming to answer four fundamental questions: when multimodal fusion improves clinical prediction, how different fusion strategies compare, how robust existing methods are to missing modalities, and whether multimodal models achieve algorithmic fairness. Our study reveals several key insights. Multimodal fusion improves performance when modalities are complete, with gains concentrating in diseases that require complementary information from both EHR and CXR. While cross-modal learning mechanisms capture clinically meaningful dependencies beyond simple concatenation, the rich temporal structure of EHR introduces strong modality imbalance that architectural complexity alone cannot overcome. Under realistic missingness, multimodal benefits rapidly degrade unless models are explicitly designed to handle incomplete inputs. Moreover, multimodal fusion does not inherently improve fairness, with subgroup disparities mainly arising from unequal sensitivity across demographic groups. To support reproducible and extensible evaluation, we further release a flexible benchmarking toolkit that enables plug-and-play integration of new models and datasets. Together, this work provides actionable guidance on when multimodal learning helps, when it fails, and why, laying the foundation for developing clinically deployable multimodal systems that are both effective and reliable. The open-source toolkit can be found at https://github.com/jakeykj/CareBench.

Executive Summary

This study investigates the effectiveness of multimodal learning in healthcare by benchmarking the fusion of Electronic Health Records (EHR) and chest X-rays (CXR) on standardized cohorts. The authors discover that multimodal fusion improves performance when modalities are complete, particularly for diseases requiring complementary information. However, multimodal benefits degrade under realistic missingness unless models are designed to handle incomplete inputs. The study also finds that multimodal fusion does not inherently improve fairness, with disparities arising from unequal sensitivity across demographic groups. The authors provide actionable guidance on when multimodal learning helps, when it fails, and why, laying the foundation for developing clinically deployable multimodal systems. The study's findings have significant implications for the development of clinical decision support systems and highlight the importance of considering modality missingness and fairness constraints in multimodal learning approaches.

Key Points

  • Multimodal fusion improves performance when modalities are complete, particularly for diseases requiring complementary information.
  • Multimodal benefits degrade under realistic missingness unless models are designed to handle incomplete inputs.
  • Multimodal fusion does not inherently improve fairness, with disparities arising from unequal sensitivity across demographic groups.

Merits

Strength in Methodology

The study employs a systematic benchmarking approach, using standardized cohorts from MIMIC-IV and MIMIC-CXR, to investigate the effectiveness of multimodal learning in healthcare.

Usefulness of Findings

The study provides actionable guidance on when multimodal learning helps, when it fails, and why, which lays the foundation for developing clinically deployable multimodal systems.

Demerits

Limitation in Generalizability

The study's findings may not be directly generalizable to other healthcare settings or diseases, as the study focuses on a specific dataset and modality combination.

Need for Further Research

The study highlights the importance of considering modality missingness and fairness constraints in multimodal learning approaches, but further research is needed to fully address these issues.

Expert Commentary

The study's findings have significant implications for the development of clinical decision support systems and highlight the importance of considering modality missingness and fairness constraints in multimodal learning approaches. The study's methodology is robust, and the findings are actionable and useful. However, the study's limitations in generalizability and the need for further research are notable. As the field of artificial intelligence in healthcare continues to evolve, this study provides a valuable contribution to the discussion of multimodal learning approaches and their potential benefits and limitations.

Recommendations

  • Future research should focus on developing multimodal learning approaches that can handle modality missingness and ensure fairness across demographic groups.
  • Policymakers should prioritize the development of clinical decision support systems that incorporate multimodal learning approaches, while also ensuring that these systems are fair and transparent.

Sources