Academic

ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

arXiv:2602.18776v1 Announce Type: new Abstract: We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, wit

A
Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan
· · 1 min read · 13 views

arXiv:2602.18776v1 Announce Type: new Abstract: We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, with most responses lacking Arabic CoT markers. Only 6 models consistently generate structured output across all test cases, while the majority require fallback extraction methods despite high numerical accuracy. Comprehensive evaluation of 281 model-strategy combinations demonstrates that numerical accuracy and instruction-following represent distinct capabilities, establishing baselines for Arabic number comprehension and providing actionable guidance for model selection in production Arabic NLP systems.

Executive Summary

The article introduces ArabicNumBench, a benchmark for evaluating large language models (LLMs) on Arabic number reading tasks, encompassing both Eastern Arabic-Indic and Western Arabic numerals. The study evaluates 71 models from 10 providers using four prompting strategies across 210 tasks and 59,010 test cases. The findings reveal significant performance variation, with accuracy ranging from 14.29% to 99.05%. Few-shot Chain-of-Thought (CoT) prompting outperforms zero-shot approaches by 2.8x. Notably, models with high numerical accuracy often produce unstructured output, highlighting a disconnect between numerical accuracy and structured output generation. The study establishes baselines for Arabic number comprehension and offers guidance for model selection in Arabic NLP systems.

Key Points

  • ArabicNumBench evaluates 71 LLMs on Arabic number reading tasks.
  • Few-shot CoT prompting achieves 2.8x higher accuracy than zero-shot approaches.
  • High numerical accuracy does not necessarily correlate with structured output.
  • Only 6 models consistently generate structured output across all test cases.
  • Numerical accuracy and instruction-following are distinct capabilities.

Merits

Comprehensive Evaluation

The study evaluates a large number of models and prompting strategies, providing a thorough assessment of current capabilities in Arabic number reading.

Detailed Analysis

The evaluation tracks structured output generation, offering insights into the models' ability to follow instructions and produce coherent responses.

Actionable Insights

The findings provide actionable guidance for model selection and improvement in production Arabic NLP systems.

Demerits

Limited Scope

The benchmark focuses solely on number reading tasks, which may not fully capture the broader capabilities of LLMs in Arabic NLP.

Model Variability

The substantial performance variation across models and strategies highlights the need for further standardization and benchmarking efforts.

Structured Output Discrepancy

The disconnect between numerical accuracy and structured output generation suggests a need for further research into instruction-following mechanisms in LLMs.

Expert Commentary

The introduction of ArabicNumBench represents a significant step forward in the evaluation of large language models for Arabic number reading tasks. The comprehensive evaluation of 71 models using diverse prompting strategies provides a robust assessment of current capabilities and highlights critical areas for improvement. The striking finding that high numerical accuracy does not necessarily correlate with structured output generation underscores the complexity of instruction-following in LLMs. This discrepancy suggests that future research should focus on developing models that not only achieve high accuracy but also consistently produce structured and coherent responses. The study's actionable insights are particularly valuable for practitioners in the field, offering guidance on model selection and prompting strategies. However, the limited scope of the benchmark to number reading tasks indicates the need for further research into other aspects of Arabic NLP. Overall, this study sets a strong foundation for future work in multilingual NLP and highlights the importance of benchmarking in advancing the field.

Recommendations

  • Develop standardized benchmarks for evaluating LLMs on a broader range of Arabic NLP tasks.
  • Investigate the mechanisms underlying structured output generation to improve instruction-following capabilities in LLMs.
  • Encourage further research into multilingual NLP to address underrepresented languages and tasks.

Sources