Academic

FinRule-Bench: A Benchmark for Joint Reasoning over Financial Tables and Principles

Arun Vignesh Malarkkan, Manan Roy Choudhury, Guangwei Zhang, Vivek Gupta, Qingyun Wang, Yanjie Fu, Denghui Zhang · March 13, 2026 · 1 min read · 24 views

#cs.AI #cs.CE #cs.LG

arXiv:2603.11339v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly applied to financial analysis, yet their ability to audit structured financial statements under explicit accounting principles remains poorly explored. Existing benchmarks primarily evaluate question answering, numerical reasoning, or anomaly detection on synthetically corrupted data, making it unclear whether models can reliably verify or localize rule compliance on correct financial statements. We introduce FinRule-Bench, a benchmark for evaluating diagnostic completeness in rule-based financial reasoning over real-world financial tables. FinRule-Bench pairs ground-truth financial statements with explicit, human-curated accounting principles and spans four canonical statement types: Balance Sheets, Cash Flow Statements, Income Statements, and Statements of Equity. The benchmark defines three auditing tasks that require progressively stronger reasoning capabilities: (i) rule verification, which tests compliance with a single principle; (ii) rule identification, which requires selecting the violated principle from a provided rule set; and (iii) joint rule diagnosis, which requires detecting and localizing multiple simultaneous violations at the record level. We evaluate LLMs under zero-shot and few-shot prompting, and introduce a causal-counterfactual reasoning protocol that enforces consistency between decisions, explanations, and counterfactual judgments. Across tasks and statement types, we find that while models perform well on isolated rule verification, performance degrades sharply for rule discrimination and multi-violation diagnosis. FinRule-Bench provides a principled and reproducible testbed for studying rule-governed reasoning, diagnostic coverage, and failure modes of LLMs in high-stakes financial analysis.

Executive Summary

This article introduces FinRule-Bench, a benchmark designed to evaluate the diagnostic capabilities of large language models (LLMs) in financial analysis. FinRule-Bench assesses LLMs' ability to audit structured financial statements under explicit accounting principles, a capability that remains poorly explored. The benchmark pairs real-world financial tables with human-curated accounting principles and defines three auditing tasks that require progressively stronger reasoning capabilities. The results show that while models perform well on isolated rule verification, performance degrades sharply for rule discrimination and multi-violation diagnosis. FinRule-Bench provides a principled and reproducible testbed for studying rule-governed reasoning, diagnostic coverage, and failure modes of LLMs in high-stakes financial analysis. This study highlights the importance of developing more robust and accurate financial analysis tools.

Key Points

▸ FinRule-Bench is a new benchmark for evaluating LLMs' diagnostic capabilities in financial analysis.
▸ The benchmark assesses LLMs' ability to audit structured financial statements under explicit accounting principles.
▸ FinRule-Bench defines three auditing tasks that require progressively stronger reasoning capabilities.

Merits

Strength

Provides a principled and reproducible testbed for studying rule-governed reasoning, diagnostic coverage, and failure modes of LLMs in high-stakes financial analysis.

Improves understanding of LLM limitations

The study highlights the limitations of LLMs in financial analysis, specifically their inability to perform well on rule discrimination and multi-violation diagnosis.

Enhances financial analysis tools

The development of FinRule-Bench contributes to the improvement of financial analysis tools by identifying areas for improvement and providing a framework for evaluating LLMs' diagnostic capabilities.

Demerits

Limited dataset

The study is limited by the availability of a comprehensive and diverse dataset of real-world financial tables.

Bias in human-curated principles

The use of human-curated accounting principles may introduce bias in the evaluation of LLMs' diagnostic capabilities.

Need for further research

The study highlights the need for further research to develop more robust and accurate financial analysis tools.

Expert Commentary

The study's findings on the limitations of LLMs in financial analysis are significant and highlight the need for further research to develop more robust and accurate financial analysis tools. The development of FinRule-Bench is a crucial step in this direction, providing a principled and reproducible testbed for studying rule-governed reasoning, diagnostic coverage, and failure modes of LLMs in high-stakes financial analysis. However, the study is limited by the availability of a comprehensive and diverse dataset of real-world financial tables and the use of human-curated accounting principles, which may introduce bias in the evaluation of LLMs' diagnostic capabilities. Nevertheless, the study's implications for financial regulation and the development of more robust and accurate financial analysis tools are significant and warrant further attention.

Recommendations

✓ Develop and implement FinRule-Bench and similar benchmarks to evaluate LLMs' diagnostic capabilities in financial analysis.
✓ Prioritize the development and implementation of more robust and accurate financial analysis tools to ensure the accuracy and reliability of financial statements.
✓ Conduct further research to address the limitations of LLMs in financial analysis and develop more effective regulations governing the use of LLMs in financial analysis.

Sources

arXiv - cs.AI

FinRule-Bench: A Benchmark for Joint Reasoning over Financial Tables and Principles

AI Commentary

Executive Summary

Key Points

Merits

Strength

Improves understanding of LLM limitations

Enhances financial analysis tools

Demerits

Limited dataset

Bias in human-curated principles

Need for further research

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs