Academic

Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Amine Kobeissi, Philippe Langlais · February 24, 2026 · 1 min read · 6 views

#cs.CL #cs.IR

arXiv:2602.17981v1 Announce Type: new Abstract: Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode in which the correct document is retrieved but the page or chunk that contains the answer is missed, leading the generator to extrapolate from incomplete context. Despite its practical significance, this within-document retrieval failure mode has received limited systematic attention in the Financial Question Answering (QA) literature. We evaluate retrieval at multiple levels of granularity, document, page, and chunk level, and introduce an oracle based analysis to provide empirical upper bounds on retrieval and generative performance. On a 150 question subset of FinanceBench, we reproduce and compare diverse retrieval strategies including dense, sparse, hybrid, and hierarchical methods with reranking and query reformulation. Across methods, gains in document discovery tend to translate into stronger page recall, yet oracle performance still suggests headroom for page and chunk level retrieval. To target this gap, we introduce a domain fine-tuned page scorer that treats pages as an intermediate retrieval unit between documents and chunks. Unlike prior passage-based hierarchical retrieval, we fine-tune a bi-encoder specifically for page-level relevance on financial filings, exploiting the semantic coherence of pages. Overall, our results demonstrate a significant improvement in page recall and chunk retrieval.

Executive Summary

This study sheds light on a critical limitation in Retrieval-Augmented Generation (RAG) for Financial Question Answering (QA), specifically the 'within-document retrieval failure mode'. Researchers evaluate retrieval strategies at various granularity levels (document, page, and chunk) and introduce a domain-fine-tuned page scorer to improve page recall and chunk retrieval. Their results demonstrate significant gains in page recall and chunk retrieval, providing empirical upper bounds on retrieval and generative performance. This study contributes to the Financial QA literature by acknowledging the importance of contextual understanding in high-stakes settings and highlighting the need for further research in this area.

Key Points

▸ The study identifies a frequent failure mode in RAG for Financial QA: within-document retrieval failure
▸ The researchers evaluate retrieval strategies at multiple levels of granularity (document, page, and chunk)
▸ A domain-fine-tuned page scorer is introduced to improve page recall and chunk retrieval

Merits

Strength in empirical analysis

The study provides a comprehensive empirical analysis of retrieval strategies, including evaluation at multiple levels of granularity and introduction of an oracle-based analysis to provide upper bounds on retrieval and generative performance.

Significant contributions to Financial QA literature

The study highlights the importance of contextual understanding in high-stakes settings and acknowledges the need for further research in this area, contributing to the Financial QA literature.

Demerits

Limited generalizability to other domains

The study focuses on Financial QA and may not be generalizable to other domains, which could limit its applicability and impact.

Dependence on high-quality training data

The performance of the domain-fine-tuned page scorer may depend on the quality of the training data, which could be a limitation in practice.

Expert Commentary

This study is a significant contribution to the Financial QA literature, highlighting the critical limitation of within-document retrieval failure mode and introducing a domain-fine-tuned page scorer to improve page recall and chunk retrieval. The study's empirical analysis provides a comprehensive evaluation of retrieval strategies, and its results demonstrate significant gains in page recall and chunk retrieval. However, the study's findings are limited to Financial QA and may not be generalizable to other domains. Furthermore, the performance of the domain-fine-tuned page scorer may depend on the quality of the training data. Nevertheless, the study's contributions to the Financial QA literature and its implications for high-stakes settings make it a valuable addition to the field.

Recommendations

✓ Future research should focus on developing retrieval strategies that are more robust and generalizable to other domains.
✓ Researchers should investigate the impact of training data quality on the performance of domain-fine-tuned page scorers.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

AI Commentary

Executive Summary

Key Points

Merits

Strength in empirical analysis

Significant contributions to Financial QA literature

Demerits

Limited generalizability to other domains

Dependence on high-quality training data

Expert Commentary

Recommendations

Sources

Related Articles

Humans and LLMs Diverge on Probabilistic Inferences

France or Spain or Germany or France: A Neural Account …

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of …

JCG, PC

HSOLLC Co., Ltd.