Academic

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

arXiv:2602.22918v1 Announce Type: new Abstract: Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared t

Jonathan Steinberg, Oren Gal · February 28, 2026 · 1 min read · 6 views

#cs.CL

Executive Summary

This study investigates the routing mechanism of optical character recognition (OCR) information in vision-language models (VLMs). Using causal interventions and activation difference analysis, the researchers identify architecture-specific OCR bottlenecks in three VLM families. The findings reveal that OCR sensitivity varies across models, with single-stage projection models peaking at early layers and DeepStack models showing peak sensitivity at mid-depth. The study also demonstrates that OCR signal is low-dimensional, with 72.9% variance captured by the first principal component. Importantly, the research shows that OCR removal can improve counting performance in some models, suggesting that OCR interferes with other visual processing. These findings have significant implications for the development of VLMs and highlight the importance of understanding OCR routing in vision-language integration.

Key Points

▸ OCR routing mechanism varies across VLM architectures
▸ Architecture-specific OCR bottlenecks identified in three VLM families
▸ OCR signal is low-dimensional, with 72.9% variance captured by the first principal component

Merits

Insight into OCR routing mechanism

The study provides valuable insights into the OCR routing mechanism in VLMs, shedding light on the complex interactions between vision and language processing.

Architecture-specific OCR bottlenecks

The identification of architecture-specific OCR bottlenecks highlights the need for tailored approaches to OCR routing in different VLM architectures.

Demerits

Limited scope

The study focuses on a limited set of VLM architectures, which may limit the generalizability of the findings.

Lack of real-world applications

The study's findings are primarily theoretical, and it remains to be seen how they will translate to real-world applications of VLMs.

Expert Commentary

The study's findings are significant because they provide new insights into the OCR routing mechanism in VLMs. The identification of architecture-specific OCR bottlenecks highlights the need for tailored approaches to OCR routing in different VLM architectures. Furthermore, the study's findings have significant implications for the development of VLMs, particularly in applications where OCR accuracy is critical. However, the study's limited scope and lack of real-world applications are notable limitations. To address these limitations, future research should aim to generalize the findings to a wider range of VLM architectures and explore real-world applications of the study's findings.

Recommendations

✓ Future research should aim to generalize the findings to a wider range of VLM architectures.
✓ Researchers should explore real-world applications of the study's findings to better understand the practical implications of OCR routing in VLMs.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

AI Commentary

Executive Summary

Key Points

Merits

Insight into OCR routing mechanism

Architecture-specific OCR bottlenecks

Demerits

Limited scope

Lack of real-world applications

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.