FinRule-Bench: A Benchmark for Joint Reasoning over Financial Tables and Principles
arXiv:2603.11339v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly applied to financial analysis, yet their ability to audit structured financial statements under explicit accounting principles remains poorly explored. Existing benchmarks primarily evaluate question answering, numerical reasoning, or anomaly detection on synthetically corrupted data, making it unclear whether models can reliably verify or localize rule compliance on correct financial statements. We introduce FinRule-Bench, a benchmark for evaluating diagnostic completeness in rule-based financial reasoning over real-world financial tables. FinRule-Bench pairs ground-truth financial statements with explicit, human-curated accounting principles and spans four canonical statement types: Balance Sheets, Cash Flow Statements, Income Statements, and Statements of Equity. The benchmark defines three auditing tasks that require progressively stronger reasoning capabilities: (i) rule verification, wh
arXiv:2603.11339v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly applied to financial analysis, yet their ability to audit structured financial statements under explicit accounting principles remains poorly explored. Existing benchmarks primarily evaluate question answering, numerical reasoning, or anomaly detection on synthetically corrupted data, making it unclear whether models can reliably verify or localize rule compliance on correct financial statements. We introduce FinRule-Bench, a benchmark for evaluating diagnostic completeness in rule-based financial reasoning over real-world financial tables. FinRule-Bench pairs ground-truth financial statements with explicit, human-curated accounting principles and spans four canonical statement types: Balance Sheets, Cash Flow Statements, Income Statements, and Statements of Equity. The benchmark defines three auditing tasks that require progressively stronger reasoning capabilities: (i) rule verification, which tests compliance with a single principle; (ii) rule identification, which requires selecting the violated principle from a provided rule set; and (iii) joint rule diagnosis, which requires detecting and localizing multiple simultaneous violations at the record level. We evaluate LLMs under zero-shot and few-shot prompting, and introduce a causal-counterfactual reasoning protocol that enforces consistency between decisions, explanations, and counterfactual judgments. Across tasks and statement types, we find that while models perform well on isolated rule verification, performance degrades sharply for rule discrimination and multi-violation diagnosis. FinRule-Bench provides a principled and reproducible testbed for studying rule-governed reasoning, diagnostic coverage, and failure modes of LLMs in high-stakes financial analysis.
Executive Summary
This article introduces FinRule-Bench, a benchmark designed to evaluate the diagnostic capabilities of large language models (LLMs) in financial analysis. FinRule-Bench assesses LLMs' ability to audit structured financial statements under explicit accounting principles, a capability that remains poorly explored. The benchmark pairs real-world financial tables with human-curated accounting principles and defines three auditing tasks that require progressively stronger reasoning capabilities. The results show that while models perform well on isolated rule verification, performance degrades sharply for rule discrimination and multi-violation diagnosis. FinRule-Bench provides a principled and reproducible testbed for studying rule-governed reasoning, diagnostic coverage, and failure modes of LLMs in high-stakes financial analysis. This study highlights the importance of developing more robust and accurate financial analysis tools.
Key Points
- ▸ FinRule-Bench is a new benchmark for evaluating LLMs' diagnostic capabilities in financial analysis.
- ▸ The benchmark assesses LLMs' ability to audit structured financial statements under explicit accounting principles.
- ▸ FinRule-Bench defines three auditing tasks that require progressively stronger reasoning capabilities.
Merits
Strength
Provides a principled and reproducible testbed for studying rule-governed reasoning, diagnostic coverage, and failure modes of LLMs in high-stakes financial analysis.
Improves understanding of LLM limitations
The study highlights the limitations of LLMs in financial analysis, specifically their inability to perform well on rule discrimination and multi-violation diagnosis.
Enhances financial analysis tools
The development of FinRule-Bench contributes to the improvement of financial analysis tools by identifying areas for improvement and providing a framework for evaluating LLMs' diagnostic capabilities.
Demerits
Limited dataset
The study is limited by the availability of a comprehensive and diverse dataset of real-world financial tables.
Bias in human-curated principles
The use of human-curated accounting principles may introduce bias in the evaluation of LLMs' diagnostic capabilities.
Need for further research
The study highlights the need for further research to develop more robust and accurate financial analysis tools.
Expert Commentary
The study's findings on the limitations of LLMs in financial analysis are significant and highlight the need for further research to develop more robust and accurate financial analysis tools. The development of FinRule-Bench is a crucial step in this direction, providing a principled and reproducible testbed for studying rule-governed reasoning, diagnostic coverage, and failure modes of LLMs in high-stakes financial analysis. However, the study is limited by the availability of a comprehensive and diverse dataset of real-world financial tables and the use of human-curated accounting principles, which may introduce bias in the evaluation of LLMs' diagnostic capabilities. Nevertheless, the study's implications for financial regulation and the development of more robust and accurate financial analysis tools are significant and warrant further attention.
Recommendations
- ✓ Develop and implement FinRule-Bench and similar benchmarks to evaluate LLMs' diagnostic capabilities in financial analysis.
- ✓ Prioritize the development and implementation of more robust and accurate financial analysis tools to ensure the accuracy and reliability of financial statements.
- ✓ Conduct further research to address the limitations of LLMs in financial analysis and develop more effective regulations governing the use of LLMs in financial analysis.