Document-Level Numerical Reasoning across Single and Multiple Tables in Financial Reports
arXiv:2604.03664v1 Announce Type: new Abstract: Despite the strong language understanding abilities of large language models (LLMs), they still struggle with reliable question answering (QA) over long, structured documents, particularly for numerical reasoning. Financial annual reports exemplify this difficulty: financial statement analysis often hinges on accurate arithmetic, and analysts derive key indicators by integrating evidence scattered across multiple tables and narrative text. However, existing benchmarks focus largely on single-table settings, leaving cross-table document-level numerical reasoning underexplored. To address this gap, we introduce FinLongDocQA, a dataset for both single-table and cross-table financial numerical reasoning in long-context reports. Evaluating both closed-source and open-source LLMs on FinLongDocQA reveals two bottlenecks: (1) annual reports often exceed 129k tokens, exacerbating the context rot problem for locating relevant tables; and (2) even
arXiv:2604.03664v1 Announce Type: new Abstract: Despite the strong language understanding abilities of large language models (LLMs), they still struggle with reliable question answering (QA) over long, structured documents, particularly for numerical reasoning. Financial annual reports exemplify this difficulty: financial statement analysis often hinges on accurate arithmetic, and analysts derive key indicators by integrating evidence scattered across multiple tables and narrative text. However, existing benchmarks focus largely on single-table settings, leaving cross-table document-level numerical reasoning underexplored. To address this gap, we introduce FinLongDocQA, a dataset for both single-table and cross-table financial numerical reasoning in long-context reports. Evaluating both closed-source and open-source LLMs on FinLongDocQA reveals two bottlenecks: (1) annual reports often exceed 129k tokens, exacerbating the context rot problem for locating relevant tables; and (2) even when relevant evidence is located, LLMs remain prone to errors in multi-step numerical reasoning. We propose FinLongDocAgent, a Multi-Agent Multi-Round Retrieval-Augmented Generation (RAG) approach that iteratively retrieves evidence, performs intermediate calculations, and verifies results across rounds. Experiments highlight the importance of iterative retrieval and verification for reliable numerical QA in long financial documents.
Executive Summary
This article presents FinLongDocQA, a dataset designed to evaluate the performance of large language models (LLMs) on financial numerical reasoning tasks involving long, structured documents, including single and multiple tables. The authors identify two key challenges: the 'context rot' problem caused by long document lengths and the tendency of LLMs to make errors in multi-step numerical reasoning. To address these challenges, they propose FinLongDocAgent, a multi-agent, multi-round retrieval-augmented generation approach that iteratively retrieves evidence, performs intermediate calculations, and verifies results. The authors' experiments demonstrate the effectiveness of FinLongDocAgent in improving the accuracy of numerical question answering in long financial documents. The work has important implications for the development of more robust and accurate language models for financial analysis.
Key Points
- ▸ FinLongDocQA is a new dataset for evaluating financial numerical reasoning in long, structured documents
- ▸ The dataset addresses a gap in existing benchmarks, which primarily focus on single-table settings
- ▸ FinLongDocAgent is a novel approach that iteratively retrieves evidence, performs intermediate calculations, and verifies results
Merits
Strength
The article addresses a significant gap in existing benchmarks and proposes a novel approach to improving the accuracy of numerical question answering in long financial documents.
Originality
The authors contribute a new dataset and approach that can be used to evaluate the performance of LLMs on financial numerical reasoning tasks.
Demerits
Limitation
The article focuses primarily on the performance of LLMs, which may not be representative of human analysts' abilities.
Scope
The dataset and approach are limited to financial numerical reasoning tasks, which may not be generalizable to other domains.
Expert Commentary
The article presents a significant contribution to the field of natural language processing, particularly in the area of question answering in long documents. The authors' approach to addressing the 'context rot' problem and improving the accuracy of numerical question answering is innovative and well-motivated. The work has important implications for the development of more robust and accurate language models for financial analysis. However, the article's focus on LLMs may limit its generalizability to other domains. Overall, the article is well-written and well-structured, and the authors' contributions are significant.
Recommendations
- ✓ Future research should focus on evaluating the performance of FinLongDocAgent on more diverse datasets and tasks, such as non-financial documents and question answering tasks that require more complex reasoning.
- ✓ The authors should investigate the use of FinLongDocAgent in real-world financial analysis applications to evaluate its practical impact and identify areas for further improvement.
Sources
Original: arXiv - cs.CL