Academic

BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo · February 22, 2026 · 1 min read · 6 views

#cs.CL

arXiv:2602.17072v1 Announce Type: new Abstract: Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset's effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs' numerical reasoning in real-world banking scenarios.

Executive Summary

The article introduces BankMathBench, a novel benchmark dataset designed to evaluate and improve the numerical reasoning capabilities of large language models (LLMs) in banking scenarios. The dataset addresses a significant gap in existing benchmarks, which often overlook the complex, multi-step numerical reasoning required for everyday banking tasks. BankMathBench is structured into three difficulty levels, reflecting basic, intermediate, and advanced banking tasks. The study demonstrates that training open-source LLMs on BankMathBench significantly enhances their accuracy in formula generation and numerical reasoning, with substantial improvements over zero-shot baselines. This work underscores the importance of domain-specific datasets in advancing the practical application of LLMs in the financial sector.

Key Points

▸ BankMathBench is a domain-specific dataset for evaluating LLMs in banking scenarios.
▸ The dataset is organized into three levels of difficulty: basic, intermediate, and advanced.
▸ Training on BankMathBench significantly improves LLMs' numerical reasoning accuracy.
▸ Tool-augmented fine-tuning leads to substantial accuracy gains over zero-shot baselines.
▸ The dataset addresses a critical gap in existing benchmarks for financial and mathematical reasoning.

Merits

Domain-Specific Focus

BankMathBench is specifically designed for banking scenarios, addressing a significant gap in existing benchmarks that often focus on fundamental math problems or financial documents.

Structured Difficulty Levels

The dataset's organization into three levels of difficulty allows for a comprehensive evaluation of LLMs' capabilities, from basic to advanced banking tasks.

Significant Accuracy Improvements

The study demonstrates substantial improvements in LLMs' numerical reasoning accuracy when trained on BankMathBench, highlighting its effectiveness in enhancing domain-specific reasoning.

Demerits

Limited Scope

While BankMathBench addresses a critical gap, its focus on banking scenarios may limit its applicability to other financial domains or more general numerical reasoning tasks.

Dependence on Open-Source Models

The study primarily uses open-source LLMs, which may not fully represent the capabilities of proprietary models that could potentially perform better with similar training.

Potential Bias in Dataset

The dataset may introduce biases inherent in the banking scenarios it represents, which could affect the generalizability of the findings.

Expert Commentary

The introduction of BankMathBench represents a significant advancement in the evaluation and improvement of LLMs for banking scenarios. The dataset's structured approach, encompassing basic to advanced tasks, provides a comprehensive framework for assessing numerical reasoning capabilities. The substantial accuracy improvements observed in the study highlight the potential of domain-specific training to enhance model performance. However, the dataset's limited scope and reliance on open-source models warrant further investigation. Additionally, the ethical and regulatory implications of deploying such models in the financial sector cannot be overlooked. As AI continues to permeate the financial industry, it is crucial to develop robust, unbiased datasets and establish clear regulatory guidelines to ensure the responsible and effective use of these technologies.

Recommendations

✓ Expand the scope of BankMathBench to include a broader range of financial domains beyond banking to enhance its applicability and generalizability.
✓ Conduct further research to evaluate the performance of proprietary LLMs on BankMathBench to provide a more comprehensive assessment of model capabilities.
✓ Develop ethical guidelines and regulatory frameworks for the use of AI in financial services to ensure transparency, accountability, and consumer protection.

Sources

arXiv - cs.CL

Something extraordinary is coming.

BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

AI Commentary

Executive Summary

Key Points

Merits

Domain-Specific Focus

Structured Difficulty Levels

Significant Accuracy Improvements

Demerits

Limited Scope

Dependence on Open-Source Models

Potential Bias in Dataset

Expert Commentary

Recommendations

Sources

Related Articles

Humans and LLMs Diverge on Probabilistic Inferences

France or Spain or Germany or France: A Neural Account …

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of …

JCG, PC

HSOLLC Co., Ltd.