Academic

IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering

arXiv:2602.17687v1 Announce Type: cross Abstract: AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibi

arXiv:2602.17687v1 Announce Type: cross Abstract: AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, enabling multimodal hybrid search to outperform either alone, achieving 49% Recall@1, 81% Recall@5, and 95% Recall@20. We further evaluate efficiency-performance tradeoffs with MUVERA and assess multiple multi-vector image embedding models. Among closed-source models, Cohere Embed v4 page image embeddings outperform Voyage 3 Large text embeddings and all tested open-source models, achieving 58% Recall@1, 87% Recall@5, and 97% Recall@20. For question answering, text-based RAG systems achieved higher ground-truth alignment than image-based systems (0.82 vs. 0.71), and both benefit substantially from increased retrieval depth, with multi-document retrieval outperforming oracle single-document retrieval. We analyze the complementary limitations of unimodal text and image representations and identify question types that require one modality over the other. The IRPAPERS dataset and all experimental code are publicly available.

Executive Summary

The article 'IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering' introduces a novel benchmark dataset consisting of 3,230 pages from 166 scientific papers, each available as both an image and an OCR transcription. The study compares the performance of image-based and text-based retrieval and question-answering systems, revealing that while text-based systems generally outperform image-based ones, a hybrid approach leveraging both modalities achieves superior results. The dataset and experimental code are made publicly available, facilitating further research in multimodal document processing.

Key Points

  • Introduction of the IRPAPERS benchmark dataset for evaluating multimodal document processing.
  • Comparison of image-based and text-based retrieval and question-answering systems.
  • Hybrid multimodal systems outperform unimodal systems in retrieval tasks.
  • Closed-source models like Cohere Embed v4 show superior performance in image-based retrieval.
  • Public availability of the dataset and experimental code for further research.

Merits

Comprehensive Benchmark

The IRPAPERS dataset provides a robust and diverse benchmark for evaluating multimodal document processing, covering a wide range of scientific papers.

Performance Comparison

The study offers a detailed comparison between image-based and text-based systems, highlighting the strengths and weaknesses of each modality.

Public Availability

The dataset and experimental code are made publicly available, fostering transparency and encouraging further research in the field.

Demerits

Limited Dataset Scope

The dataset is limited to scientific papers, which may not fully represent the diversity of document types in real-world applications.

Performance Gaps

While hybrid systems show promise, there are still significant performance gaps between unimodal and multimodal systems, indicating areas for further improvement.

Dependency on Closed-Source Models

The study highlights the superior performance of closed-source models, which may limit accessibility and reproducibility for researchers without access to these models.

Expert Commentary

The introduction of the IRPAPERS benchmark dataset represents a significant step forward in the field of multimodal document processing. By providing a comprehensive and diverse set of scientific papers in both image and text formats, the dataset enables rigorous evaluation of retrieval and question-answering systems. The study's findings highlight the complementary nature of image-based and text-based systems, demonstrating that a hybrid approach can outperform unimodal systems. This is particularly noteworthy given the current trends in AI research, which increasingly emphasize the integration of multiple modalities to enhance system performance. However, the study also reveals important limitations, such as the performance gaps between unimodal and multimodal systems and the reliance on closed-source models. These limitations underscore the need for continued research and development in this area. The public availability of the dataset and experimental code is a commendable aspect of the study, as it promotes transparency and encourages further research. Overall, the study provides valuable insights into the capabilities and limitations of current multimodal document processing systems and sets the stage for future advancements in this critical field.

Recommendations

  • Expand the IRPAPERS dataset to include a more diverse range of document types beyond scientific papers to better represent real-world applications.
  • Investigate the development of open-source multimodal models that can achieve performance levels comparable to closed-source models, thereby increasing accessibility and reproducibility.

Sources