Academic

IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions

arXiv:2602.21226v1 Announce Type: cross Abstract: As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates above 40%. These results h

Ezieddin Elmahjub, Junaid Qadir, Abdullah Mushtaq, Rafay Naeem, Ibrahim Ghaznavi, Waleed Iqbal · March 2, 2026 · 1 min read · 0 views

#cs.CL #cs.AI

Executive Summary

The article 'IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions' introduces a novel benchmark for assessing the capabilities of large language models (LLMs) in understanding and reasoning about Islamic law. The study evaluates nine state-of-the-art LLMs across seven schools of Islamic jurisprudence, using 718 instances covering 13 tasks of varying complexity. The findings reveal significant limitations in the models' performance, with the best model achieving only 68% correctness and 21% hallucination rates. The study highlights the risks associated with relying on LLMs for religious guidance and emphasizes the need for a systematic framework to evaluate Islamic legal reasoning in AI.

Key Points

▸ Introduction of IslamicLegalBench as the first benchmark for evaluating LLMs' understanding of Islamic law.
▸ Evaluation of nine state-of-the-art LLMs across seven schools of Islamic jurisprudence.
▸ Significant limitations in models' performance, with the best model achieving 68% correctness and 21% hallucination.
▸ Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%.
▸ High error rates in moderate-complexity tasks and risky sycophancy in false premise detection.

Merits

Comprehensive Evaluation

The study provides a thorough and systematic evaluation of LLMs' capabilities in understanding Islamic law, covering a wide range of tasks and complexity levels.

Novel Benchmark

IslamicLegalBench is the first of its kind, offering a standardized framework for assessing Islamic legal reasoning in AI.

Critical Insights

The findings highlight significant limitations and risks associated with relying on LLMs for religious guidance, which is crucial for both developers and users.

Demerits

Limited Model Selection

The study evaluates only nine state-of-the-art models, which may not represent the full spectrum of LLMs available.

Hallucination Rates

High hallucination rates in several models indicate a need for further research to mitigate this issue.

Few-shot Prompting Limitations

The minimal gains from few-shot prompting suggest that current prompting techniques may not be sufficient to improve performance significantly.

Expert Commentary

The article 'IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions' presents a rigorous and well-reasoned evaluation of LLMs' capabilities in understanding Islamic law. The introduction of IslamicLegalBench as a novel benchmark is a significant contribution to the field, providing a systematic framework for assessing Islamic legal reasoning in AI. The study's findings reveal critical limitations in the models' performance, highlighting the risks associated with relying on LLMs for religious guidance. The high hallucination rates and the minimal gains from few-shot prompting underscore the need for further research and development to improve the accuracy and reliability of these models. The study's comprehensive evaluation and critical insights make it a valuable resource for both academic and practical applications. However, the limited selection of models and the focus on few-shot prompting techniques suggest areas for future research and improvement. Overall, the article offers a balanced and objective analysis that is professionally written and suitable for publication in a premium legal/academic journal.

Recommendations

✓ Developers of LLMs should invest in improving the models' foundational knowledge of Islamic law to reduce hallucination rates and enhance accuracy.
✓ Researchers should explore advanced prompting techniques and other methods to improve the performance of LLMs in understanding and reasoning about Islamic law.

Sources

arXiv - cs.AI

Something extraordinary is coming.

IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Evaluation

Novel Benchmark

Critical Insights

Demerits

Limited Model Selection

Hallucination Rates

Few-shot Prompting Limitations

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.