Academic

OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

arXiv:2603.02789v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline--while simpler--can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can o

arXiv:2603.02789v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline--while simpler--can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.

Executive Summary

This paper presents a comprehensive benchmarking study on document information extraction in the era of multimodal large language models (MLLMs). The authors investigate whether MLLM-only pipelines can replace traditional OCR+MLLM setups without compromising performance. Using large-scale real-world datasets, they evaluate out-of-the-box MLLMs and introduce an automated hierarchical error analysis framework leveraging LLMs to diagnose error patterns. The findings indicate that image-only input with powerful MLLMs can achieve comparable performance to OCR-enhanced methods, suggesting OCR may be unnecessary in certain contexts. Additionally, the study highlights the effectiveness of tailored schema, exemplars, and instructions in boosting MLLM performance. These results offer practical insights for optimizing document information extraction workflows.

Key Points

  • MLLM-only pipelines can match OCR+MLLM performance in document information extraction
  • Image-only input achieves comparable results to OCR-enhanced setups
  • Custom schema, exemplars, and instructions enhance MLLM effectiveness

Merits

Innovative Benchmarking

The large-scale evaluation using real-world datasets provides empirical credibility to the findings.

Error Analysis Framework

The automated hierarchical error analysis using LLMs is a novel and scalable approach to diagnosing performance issues.

Demerits

Generalizability Concerns

Results may not extend uniformly to all document types or industry-specific formats beyond the tested business documents.

Assumption Dependency

Findings hinge on the assumption that MLLMs are sufficiently advanced to handle image-only inputs without OCR, which may vary in real-world deployments.

Expert Commentary

This work represents a pivotal shift in the discourse around document information extraction. The empirical validation that MLLMs can replace OCR in certain contexts challenges conventional wisdom and opens new avenues for efficiency and cost reduction. The introduction of an automated error analysis framework via LLMs is particularly noteworthy—it demonstrates a sophisticated application of AI to diagnose AI itself. However, the study’s limitations must be acknowledged: while the results are compelling for business documents, scalability to legal, medical, or archival documents remains unaddressed. Moreover, the dependency on the current state of MLLM capabilities raises questions about temporal relevance—as MLLM performance evolves, so too may the need for ancillary OCR support. This paper should be read as a catalyst for reevaluation rather than a definitive endpoint. It invites further research into hybrid architectures, contextual adaptability, and longitudinal performance tracking across diverse document ecosystems.

Recommendations

  • 1. Evaluate the feasibility of MLLM-only pipelines in document extraction for internal workflows using advanced MLLMs.
  • 2. Invest in schema design and instruction refinement to maximize MLLM performance without OCR.
  • 3. Conduct comparative studies on OCR necessity across document domains beyond business contexts to determine universal applicability.

Sources