Academic

IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

arXiv:2603.23750v1 Announce Type: new Abstract: Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8\% to 93.8\% (by Gemini 3 Flash). The Quran track shows the widest span (99.3\% to 32.4\%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific

arXiv:2603.23750v1 Announce Type: new Abstract: Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8\% to 93.8\% (by Gemini 3 Flash). The Quran track shows the widest span (99.3\% to 32.4\%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.

Executive Summary

This article introduces IslamicMMLU, a comprehensive benchmark for evaluating the performance of large language models (LLMs) on Islamic knowledge. The benchmark consists of 10,013 multiple-choice questions across three tracks: Quran, Hadith, and Fiqh. The benchmark is used to create a public leaderboard for evaluating LLMs, and the authors evaluate 26 LLMs, observing a wide range of performance, from 39.8% to 93.8% accuracy. The study highlights the need for evaluations specific to Islamic knowledge and the potential biases of LLMs in handling different aspects of Islamic jurisprudence. The authors make the evaluation code and leaderboard publicly available, contributing to the development of accurate and reliable LLMs for Islamic knowledge applications.

Key Points

  • IslamicMMLU is a comprehensive benchmark for evaluating LLMs on Islamic knowledge.
  • The benchmark consists of 10,013 multiple-choice questions across three tracks: Quran, Hadith, and Fiqh.
  • The study highlights the need for evaluations specific to Islamic knowledge and the potential biases of LLMs.

Merits

Comprehensive Coverage

IslamicMMLU covers a wide range of Islamic knowledge, including Quran, Hadith, and Fiqh, making it a valuable resource for evaluating LLMs.

Public Availability

The evaluation code and leaderboard are made publicly available, enabling researchers and developers to contribute to the development of accurate and reliable LLMs for Islamic knowledge applications.

Demerits

Limited Generalizability

The study focuses on a specific domain (Islamic knowledge) and may not be generalizable to other domains or applications.

Lack of Contextual Understanding

The study relies on multiple-choice questions, which may not adequately assess the contextual understanding and nuance of Islamic knowledge.

Expert Commentary

The introduction of IslamicMMLU marks a significant step towards evaluating LLMs on Islamic knowledge. However, the study's limitations, such as limited generalizability and lack of contextual understanding, should be addressed in future research. Furthermore, the identification of biases in LLMs handling Islamic jurisprudence highlights the need for ongoing monitoring and evaluation of these tools. As the field of LLMs continues to evolve, it is essential to prioritize the development of accurate and reliable models that can handle diverse cultural and religious knowledge domains.

Recommendations

  • Future research should prioritize the development of domain-specific LLMs that can accurately and reliably process cultural and religious knowledge.
  • Developers and policymakers should continue to monitor and evaluate LLMs for biases and inaccuracies, ensuring that these tools do not perpetuate harm or misinformation.

Sources

Original: arXiv - cs.CL