Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
arXiv:2603.09095v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and
arXiv:2603.09095v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.
Executive Summary
This study systematically investigates the 'modality gap' in multimodal large language models (MLLMs), where models perform worse when processing text as images compared to textual tokens. The researchers evaluate seven MLLMs across seven benchmarks in five input modes, revealing that the modality gap is task- and data-dependent. They identify rendering choices such as font and resolution as strong confounds and propose a self-distillation method to improve visual text understanding in MLLMs. This study provides a comprehensive understanding of the modality gap and offers a practical path forward for improving MLLMs' visual text understanding capabilities.
Key Points
- ▸ The modality gap in MLLMs is task- and data-dependent.
- ▸ Rendering choices such as font and resolution are strong confounds.
- ▸ A self-distillation method can improve visual text understanding in MLLMs.
Merits
Strength in Methodology
The study's comprehensive evaluation of seven MLLMs across seven benchmarks in five input modes provides a robust understanding of the modality gap.
Practical Solution
The proposed self-distillation method offers a practical solution to improve visual text understanding in MLLMs.
Demerits
Limited Generalizability
The study's findings may not generalize to all MLLMs or tasks, limiting the study's broader implications.
Complexity of Proposed Method
The self-distillation method may be computationally intensive and require significant resources to implement.
Expert Commentary
This study is a significant contribution to the field of multimodal language models, providing a comprehensive understanding of the modality gap and offering a practical solution to improve visual text understanding. The researchers' use of a grounded-theory error analysis to identify the selective amplification of reading errors in image mode is particularly noteworthy. While the study's limitations are acknowledged, the proposed self-distillation method has the potential to significantly improve the performance of MLLMs on visual text understanding tasks. The study's findings and proposed method have implications for the development of more effective MLLMs and may inform the development of more effective multimodal language models for specific tasks and applications.
Recommendations
- ✓ Future studies should investigate the generalizability of the study's findings to other MLLMs and tasks.
- ✓ Researchers should explore the application of the proposed self-distillation method to other multimodal language models and tasks.