IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions
arXiv:2602.21226v1 Announce Type: cross Abstract: As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates above 40%. These results h
arXiv:2602.21226v1 Announce Type: cross Abstract: As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates above 40%. These results highlight that prompt-based methods cannot compensate for missing foundational knowledge. IslamicLegalBench offers the first systematic framework to evaluate Islamic legal reasoning in AI, revealing critical gaps in tools increasingly relied on for spiritual guidance.
Executive Summary
The article 'IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions' introduces a novel benchmark for assessing the capabilities of large language models (LLMs) in understanding and reasoning about Islamic law. The study evaluates nine state-of-the-art LLMs across seven schools of Islamic jurisprudence, using 718 instances covering 13 tasks of varying complexity. The findings reveal significant limitations in the models' performance, with the best model achieving only 68% correctness and 21% hallucination rates. The study highlights the risks associated with relying on LLMs for religious guidance and emphasizes the need for a systematic framework to evaluate Islamic legal reasoning in AI.
Key Points
- ▸ Introduction of IslamicLegalBench as the first benchmark for evaluating LLMs' understanding of Islamic law.
- ▸ Evaluation of nine state-of-the-art LLMs across seven schools of Islamic jurisprudence.
- ▸ Significant limitations in models' performance, with the best model achieving 68% correctness and 21% hallucination.
- ▸ Few-shot prompting provides minimal gains, improving only 2 of 9 models by >1%.
- ▸ High error rates in moderate-complexity tasks and risky sycophancy in false premise detection.
Merits
Comprehensive Evaluation
The study provides a thorough and systematic evaluation of LLMs' capabilities in understanding Islamic law, covering a wide range of tasks and complexity levels.
Novel Benchmark
IslamicLegalBench is the first of its kind, offering a standardized framework for assessing Islamic legal reasoning in AI.
Critical Insights
The findings highlight significant limitations and risks associated with relying on LLMs for religious guidance, which is crucial for both developers and users.
Demerits
Limited Model Selection
The study evaluates only nine state-of-the-art models, which may not represent the full spectrum of LLMs available.
Hallucination Rates
High hallucination rates in several models indicate a need for further research to mitigate this issue.
Few-shot Prompting Limitations
The minimal gains from few-shot prompting suggest that current prompting techniques may not be sufficient to improve performance significantly.
Expert Commentary
The article 'IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions' presents a rigorous and well-reasoned evaluation of LLMs' capabilities in understanding Islamic law. The introduction of IslamicLegalBench as a novel benchmark is a significant contribution to the field, providing a systematic framework for assessing Islamic legal reasoning in AI. The study's findings reveal critical limitations in the models' performance, highlighting the risks associated with relying on LLMs for religious guidance. The high hallucination rates and the minimal gains from few-shot prompting underscore the need for further research and development to improve the accuracy and reliability of these models. The study's comprehensive evaluation and critical insights make it a valuable resource for both academic and practical applications. However, the limited selection of models and the focus on few-shot prompting techniques suggest areas for future research and improvement. Overall, the article offers a balanced and objective analysis that is professionally written and suitable for publication in a premium legal/academic journal.
Recommendations
- ✓ Developers of LLMs should invest in improving the models' foundational knowledge of Islamic law to reduce hallucination rates and enhance accuracy.
- ✓ Researchers should explore advanced prompting techniques and other methods to improve the performance of LLMs in understanding and reasoning about Islamic law.