Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
arXiv:2602.14162v1 Announce Type: new Abstract: Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this "pre-ingestion" approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI's core principle is "Index for locating, not understanding"--achieving page localization through structured metadata indexes and BM25 fu
arXiv:2602.14162v1 Announce Type: new Abstract: Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this "pre-ingestion" approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI's core principle is "Index for locating, not understanding"--achieving page localization through structured metadata indexes and BM25 full-text search, then sending original images along with specific questions to a VLM for targeted analysis. Experiments on two real industrial engineering drawings (113 pages + 7 pages) demonstrate that DVI achieves comparable overall accuracy at zero ingestion VLM cost (46.7% vs. 48.9%), an effectiveness rate of 50% on visually necessary queries (vs. 0% for pre-ingestion), and 100% page localization (98% search space compression). DVI also supports interactive refinement and progressive caching, transforming the "QA accuracy" problem into a "page localization" problem--once the correct drawing page is found, obtaining the answer becomes a matter of interaction rounds.
Executive Summary
The article 'Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering' introduces the Deferred Visual Ingestion (DVI) framework, a novel approach to multimodal document question answering. Unlike traditional methods that pre-process every page with a Vision-Language Model (VLM) during indexing, DVI defers visual understanding until a specific question is posed. This demand-side strategy focuses on lightweight metadata extraction during indexing, significantly reducing computational costs and improving reliability. The study demonstrates that DVI achieves comparable accuracy to pre-ingestion methods while eliminating ingestion costs and enhancing effectiveness for visually necessary queries.
Key Points
- ▸ DVI adopts a demand-side ingestion strategy, deferring visual understanding until specific questions are asked.
- ▸ The framework achieves zero ingestion VLM cost while maintaining comparable overall accuracy.
- ▸ DVI demonstrates a 50% effectiveness rate on visually necessary queries, compared to 0% for pre-ingestion methods.
- ▸ The approach supports interactive refinement and progressive caching, transforming the QA problem into a page localization problem.
Merits
Cost Efficiency
DVI eliminates the need for pre-ingestion of every page, significantly reducing computational costs and making the process more scalable.
Improved Reliability
By deferring visual understanding, DVI avoids potential failures in retrieval infrastructure, ensuring more reliable question answering.
Enhanced Accuracy for Visual Queries
DVI achieves a 50% effectiveness rate on visually necessary queries, addressing a critical limitation of pre-ingestion methods.
Demerits
Dependency on Metadata Quality
The effectiveness of DVI relies heavily on the quality and comprehensiveness of the metadata extracted during indexing. Poor metadata could lead to inaccurate page localization.
Interactive Refinement Requirements
The need for interactive refinement may increase the time and effort required to obtain accurate answers, potentially limiting its applicability in time-sensitive scenarios.
Scalability Concerns
While DVI reduces ingestion costs, the on-demand processing of visual content may pose scalability challenges in environments with high query volumes.
Expert Commentary
The Deferred Visual Ingestion (DVI) framework represents a significant advancement in the field of multimodal document question answering. By shifting from a supply-side to a demand-side ingestion strategy, DVI addresses critical limitations of traditional pre-ingestion methods, including high computational costs and reliability issues. The framework's ability to achieve comparable accuracy while eliminating ingestion costs is particularly noteworthy. However, the success of DVI hinges on the quality of metadata extraction, which must be comprehensive and accurate to ensure effective page localization. Additionally, the requirement for interactive refinement may pose challenges in time-sensitive applications. Despite these limitations, DVI's potential to transform the QA problem into a page localization problem offers a promising avenue for future research and practical implementation. The framework's scalability and applicability across various industries, particularly those dealing with large volumes of visual documents, make it a valuable contribution to the field.
Recommendations
- ✓ Further research should focus on improving metadata extraction techniques to enhance the accuracy and reliability of page localization in DVI.
- ✓ Practical implementations of DVI should include robust interactive refinement mechanisms to support iterative querying and result refinement, ensuring a seamless user experience.