Academic

Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction

Melika Filvantorkaman, Mohsen Piri · March 7, 2026 · 1 min read · 2 views

#cs.LG #cs.AI #cs.CL #cs.CV

arXiv:2602.17689v1 Announce Type: cross Abstract: Medical vision-language models show strong potential for joint reasoning over medical images and clinical text, but their performance often degrades under domain shift caused by variations in imaging devices, acquisition protocols, and reporting styles. Existing multi-modal pre-training methods largely overlook robustness, treating it as a downstream adaptation problem. In this work, we propose Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework that explicitly incorporates robustness objectives into masked vision-language learning. Robust-MMR integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to encourage domain-invariant representations. We evaluate Robust-MMR on multiple medical vision-language benchmarks, including medical visual question answering (VQA-RAD, SLAKE, VQA-2019), cross-domain image-text classification (MELINDA), and robust image-caption retrieval (ROCO). Robust-MMR achieves 78.9% cross-domain accuracy on VQA-RAD, outperforming the strongest baseline by 3.8 percentage points, and reaches 74.6% and 77.0% accuracy on SLAKE and VQA-2019, respectively. Under perturbed evaluation, Robust-MMR improves VQA-RAD accuracy from 69.1% to 75.6%. For image-text classification, cross-domain MELINDA accuracy increases from 70.3% to 75.2%, while retrieval experiments show a reduction in mean rank degradation from over 16 to 4.1 under perturbation. Qualitative results further demonstrate improved clinical reasoning for disease detection and structural abnormality assessment. These findings show that explicitly modeling robustness during pre-training leads to more reliable and transferable medical vision-language representations for real-world deployment.

Executive Summary

The article introduces Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework designed to enhance the robustness of medical vision-language models. By incorporating domain-invariant representations through asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints, Robust-MMR addresses the challenge of domain shift in medical imaging and text data. The study demonstrates significant improvements in cross-domain accuracy and robustness across multiple benchmarks, including medical visual question answering, image-text classification, and retrieval tasks. The findings underscore the importance of explicitly modeling robustness during pre-training to achieve more reliable and transferable medical vision-language representations.

Key Points

▸ Introduction of Robust-MMR framework for robust pre-training of medical vision-language models.
▸ Integration of domain-invariant representations through perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints.
▸ Significant performance improvements across multiple medical vision-language benchmarks.
▸ Demonstration of enhanced robustness under perturbed evaluation conditions.
▸ Implications for real-world deployment of medical vision-language models.

Merits

Innovative Framework

Robust-MMR introduces a novel approach to pre-training that explicitly addresses robustness, which is a critical issue in medical vision-language models.

Comprehensive Evaluation

The study evaluates Robust-MMR on multiple benchmarks, providing a thorough assessment of its performance and robustness.

Practical Impact

The improvements in cross-domain accuracy and robustness have direct implications for the real-world deployment of medical vision-language models.

Demerits

Limited Generalizability

The study focuses on specific medical vision-language tasks and may not fully generalize to other domains or applications.

Complexity

The Robust-MMR framework introduces additional complexity, which may require significant computational resources and expertise for implementation.

Data Dependency

The effectiveness of Robust-MMR is highly dependent on the quality and diversity of the training data, which may not always be available.

Expert Commentary

The article presents a significant advancement in the field of medical vision-language models by introducing the Robust-MMR framework. The explicit incorporation of robustness objectives during pre-training addresses a critical gap in existing methods, which often treat robustness as a downstream adaptation problem. The comprehensive evaluation across multiple benchmarks demonstrates the framework's effectiveness in improving cross-domain accuracy and robustness under perturbed conditions. The study's findings have important implications for the real-world deployment of medical vision-language models, highlighting the need for more reliable and transferable representations. However, the complexity and data dependency of the Robust-MMR framework pose challenges for widespread adoption. Future research should focus on simplifying the framework and exploring its applicability to other domains. Overall, the article makes a valuable contribution to the field and sets a new standard for robust pre-training in medical vision-language models.

Recommendations

✓ Further exploration of the Robust-MMR framework's applicability to other medical and non-medical domains.
✓ Development of simplified and more efficient versions of the Robust-MMR framework to facilitate broader adoption.

Sources

arXiv - cs.AI

Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction

AI Commentary

Executive Summary

Key Points

Merits

Innovative Framework

Comprehensive Evaluation

Practical Impact

Demerits

Limited Generalizability

Complexity

Data Dependency

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs