Academic

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai · March 11, 2026 · 1 min read · 11 views

#cs.CL #cs.CV

arXiv:2603.09095v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.

Executive Summary

This study systematically investigates the 'modality gap' in multimodal large language models (MLLMs), where models perform worse when processing text as images compared to textual tokens. The researchers evaluate seven MLLMs across seven benchmarks in five input modes, revealing that the modality gap is task- and data-dependent. They identify rendering choices such as font and resolution as strong confounds and propose a self-distillation method to improve visual text understanding in MLLMs. This study provides a comprehensive understanding of the modality gap and offers a practical path forward for improving MLLMs' visual text understanding capabilities.

Key Points

▸ The modality gap in MLLMs is task- and data-dependent.
▸ Rendering choices such as font and resolution are strong confounds.
▸ A self-distillation method can improve visual text understanding in MLLMs.

Merits

Strength in Methodology

The study's comprehensive evaluation of seven MLLMs across seven benchmarks in five input modes provides a robust understanding of the modality gap.

Practical Solution

The proposed self-distillation method offers a practical solution to improve visual text understanding in MLLMs.

Demerits

Limited Generalizability

The study's findings may not generalize to all MLLMs or tasks, limiting the study's broader implications.

Complexity of Proposed Method

The self-distillation method may be computationally intensive and require significant resources to implement.

Expert Commentary

This study is a significant contribution to the field of multimodal language models, providing a comprehensive understanding of the modality gap and offering a practical solution to improve visual text understanding. The researchers' use of a grounded-theory error analysis to identify the selective amplification of reading errors in image mode is particularly noteworthy. While the study's limitations are acknowledged, the proposed self-distillation method has the potential to significantly improve the performance of MLLMs on visual text understanding tasks. The study's findings and proposed method have implications for the development of more effective MLLMs and may inform the development of more effective multimodal language models for specific tasks and applications.

Recommendations

✓ Future studies should investigate the generalizability of the study's findings to other MLLMs and tasks.
✓ Researchers should explore the application of the proposed self-distillation method to other multimodal language models and tasks.

Sources

arXiv - cs.CL

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

AI Commentary

Executive Summary

Key Points

Merits

Strength in Methodology

Practical Solution

Demerits

Limited Generalizability

Complexity of Proposed Method

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs