Skip to main content
Academic

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

arXiv:2602.22918v1 Announce Type: new Abstract: Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared t

J
Jonathan Steinberg, Oren Gal
· · 1 min read · 6 views

arXiv:2602.22918v1 Announce Type: new Abstract: Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.

Executive Summary

This study investigates the routing mechanism of optical character recognition (OCR) information in vision-language models (VLMs). Using causal interventions and activation difference analysis, the researchers identify architecture-specific OCR bottlenecks in three VLM families. The findings reveal that OCR sensitivity varies across models, with single-stage projection models peaking at early layers and DeepStack models showing peak sensitivity at mid-depth. The study also demonstrates that OCR signal is low-dimensional, with 72.9% variance captured by the first principal component. Importantly, the research shows that OCR removal can improve counting performance in some models, suggesting that OCR interferes with other visual processing. These findings have significant implications for the development of VLMs and highlight the importance of understanding OCR routing in vision-language integration.

Key Points

  • OCR routing mechanism varies across VLM architectures
  • Architecture-specific OCR bottlenecks identified in three VLM families
  • OCR signal is low-dimensional, with 72.9% variance captured by the first principal component

Merits

Insight into OCR routing mechanism

The study provides valuable insights into the OCR routing mechanism in VLMs, shedding light on the complex interactions between vision and language processing.

Architecture-specific OCR bottlenecks

The identification of architecture-specific OCR bottlenecks highlights the need for tailored approaches to OCR routing in different VLM architectures.

Demerits

Limited scope

The study focuses on a limited set of VLM architectures, which may limit the generalizability of the findings.

Lack of real-world applications

The study's findings are primarily theoretical, and it remains to be seen how they will translate to real-world applications of VLMs.

Expert Commentary

The study's findings are significant because they provide new insights into the OCR routing mechanism in VLMs. The identification of architecture-specific OCR bottlenecks highlights the need for tailored approaches to OCR routing in different VLM architectures. Furthermore, the study's findings have significant implications for the development of VLMs, particularly in applications where OCR accuracy is critical. However, the study's limited scope and lack of real-world applications are notable limitations. To address these limitations, future research should aim to generalize the findings to a wider range of VLM architectures and explore real-world applications of the study's findings.

Recommendations

  • Future research should aim to generalize the findings to a wider range of VLM architectures.
  • Researchers should explore real-world applications of the study's findings to better understand the practical implications of OCR routing in VLMs.

Sources