ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning
arXiv:2602.17054v1 Announce Type: new Abstract: While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but
arXiv:2602.17054v1 Announce Type: new Abstract: While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification. We introduce ALPS (Arabic Linguistic & Pragmatic Suite), a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics, capabilities that complement specialized large-scale benchmarks. While broad-coverage benchmarks prioritize scale and multi-task coverage, ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks. We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts. Evaluating 23 diverse models (commercial, open-source, and Arabic-native) against a single-pass human performance (avg. 84.6% accuracy) and an expert-adjudicated oracle (99.2%), we reveal a critical dissociation: models achieve high fluency but fail on fundamental morpho-syntactic dependencies, with elevated error rates on morpho-syntactic dependencies (36.5% across diacritics-reliant tasks) compared to compositional semantics. While top commercial models (Gemini-3-flash at 94.2%) surpass the average single human, a substantial gap persists between commercial giants and Arabic-native models, with the best Arabic-specific model (Jais-2-70B at 83.6%) approaching but not matching human performance.
Executive Summary
The ALPS diagnostic challenge set is introduced as a native, expert-curated benchmark for Arabic linguistic and pragmatic reasoning, focusing on deep semantics and pragmatics. It consists of 531 questions across 15 tasks and 47 subtasks, evaluating 23 diverse models against human performance and an expert-adjudicated oracle. The results reveal a critical dissociation between models' fluency and their ability to handle fundamental morpho-syntactic dependencies, with commercial models outperforming Arabic-native models but still showing a substantial gap with human performance.
Key Points
- ▸ Introduction of ALPS, a diagnostic challenge set for Arabic linguistic and pragmatic reasoning
- ▸ Evaluation of 23 diverse models against human performance and an expert-adjudicated oracle
- ▸ Revelation of a critical dissociation between models' fluency and their ability to handle morpho-syntactic dependencies
Merits
Cultural Authenticity
The dataset is developed with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts.
Comprehensive Evaluation
The evaluation of 23 diverse models provides a comprehensive understanding of the current state of Arabic NLP benchmarks.
Demerits
Limited Scope
The focus on deep semantics and pragmatics may limit the scope of the benchmark, potentially overlooking other important aspects of Arabic NLP.
Gap between Commercial and Arabic-Native Models
The substantial gap between commercial models and Arabic-native models may indicate a need for further development and investment in Arabic-native models.
Expert Commentary
The introduction of ALPS as a diagnostic challenge set for Arabic linguistic and pragmatic reasoning is a significant step forward in the development of Arabic NLP benchmarks. The evaluation of 23 diverse models provides a comprehensive understanding of the current state of Arabic NLP, highlighting the need for deeper linguistic verification and cultural authenticity. The revelation of a critical dissociation between models' fluency and their ability to handle morpho-syntactic dependencies underscores the importance of developing more accurate and robust NLP models, particularly in handling morpho-syntactic dependencies. Further research and development are necessary to bridge the gap between commercial models and Arabic-native models, and to support the development of more accurate and robust NLP models for Arabic.
Recommendations
- ✓ Further development and investment in Arabic-native models to bridge the gap with commercial models.
- ✓ The incorporation of deeper linguistic verification and cultural authenticity in the development of Arabic NLP benchmarks and models.
- ✓ The support of policymakers for the development of more accurate and robust NLP models for Arabic, particularly in handling morpho-syntactic dependencies.