Academic

MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

arXiv:2603.07539v1 Announce Type: new Abstract: Islamic inheritance law ('ilm al-mawarith) is challenging for large language models because solving inheritance cases requires complex, structured multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce MAWARITH, a large-scale annotated dataset of 12,500 Arabic inheritance cases to train and evaluate the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (hajb) and allocation rules, and (iii) computing exact inheritance shares. Unlike prior datasets that restrict inheritance case solving to multiple-choice questions, MAWARITH supports the full reasoning chain and provides step-by-step solutions, including intermediate legal decisions and justifications based on classical juristic sources and established inheritance rules, as well as exact share calculations. To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluati

arXiv:2603.07539v1 Announce Type: new Abstract: Islamic inheritance law ('ilm al-mawarith) is challenging for large language models because solving inheritance cases requires complex, structured multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce MAWARITH, a large-scale annotated dataset of 12,500 Arabic inheritance cases to train and evaluate the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (hajb) and allocation rules, and (iii) computing exact inheritance shares. Unlike prior datasets that restrict inheritance case solving to multiple-choice questions, MAWARITH supports the full reasoning chain and provides step-by-step solutions, including intermediate legal decisions and justifications based on classical juristic sources and established inheritance rules, as well as exact share calculations. To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline. We evaluate five LLMs in a zero-shot setting. Gemini-2.5-flash achieves about 90% MIR-E on both validation and test, while Fanar-C, Fanar-Sadiq, LLaMA 3, and Qwen 3 remain below 50%. Our error analysis identifies recurring failure patterns, including scenario misinterpretation, errors in heir identification, errors in share allocation, and missing or incorrect application of key inheritance rules such as 'awl and radd. The MAWARITH dataset is publicly available at https://github.com/bouchekif/inheritance_evaluation.

Executive Summary

This article introduces MAWARITH, a large-scale annotated dataset of 12,500 Arabic inheritance cases to evaluate the full reasoning chain of Islamic inheritance law with Large Language Models (LLMs). The dataset supports the full reasoning chain and provides step-by-step solutions, including intermediate legal decisions and justifications. The authors propose a weighted multi-stage metric, MIR-E, to evaluate models beyond final-answer accuracy. The results show that Gemini-2.5-flash achieves high accuracy in a zero-shot setting, while other LLMs struggle. The MAWARITH dataset is publicly available, and the authors identify recurring failure patterns in LLMs, including scenario misinterpretation and errors in share allocation.

Key Points

  • MAWARITH is a large-scale annotated dataset of 12,500 Arabic inheritance cases.
  • The dataset supports the full reasoning chain of Islamic inheritance law.
  • MIR-E is a weighted multi-stage metric to evaluate models beyond final-answer accuracy.
  • Gemini-2.5-flash achieves high accuracy in a zero-shot setting.

Merits

Comprehensive dataset

MAWARITH is a large and comprehensive dataset that covers various aspects of Islamic inheritance law, making it an excellent resource for researchers and developers.

Demerits

Limited scope

The dataset is limited to Arabic inheritance cases, which may not be representative of other legal systems or cultural contexts.

Expert Commentary

The article presents a significant contribution to the development of LLMs for legal tasks, particularly in the area of Islamic inheritance law. The MAWARITH dataset is a valuable resource for researchers and developers, and the proposed metric, MIR-E, provides a more comprehensive evaluation framework for LLMs. However, the limitations of the dataset, such as its restricted scope, highlight the need for further research and diversity in legal datasets. The implications of this research are far-reaching, emphasizing the need for more accurate and reliable LLMs in legal applications and the potential impact on the administration of justice and access to legal services.

Recommendations

  • Develop more diverse and comprehensive legal datasets to improve the generalizability of LLMs.
  • Investigate the use of transfer learning and domain adaptation techniques to improve LLMs' performance in handling complex legal tasks.

Sources