Academic

SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

arXiv:2603.04410v1 Announce Type: new Abstract: Safety alignment in Language Models (LMs) is fundamental for trustworthy AI. However, while different stakeholders are trying to leverage Arabic Language Models (ALMs), systematic safety evaluation of ALMs remains largely underexplored, limiting their mainstream uptake. Existing safety benchmarks and safeguard models are predominantly English-centric, limiting their applicability to Arabic Natural Language Processing (NLP) systems and obscuring fine-grained, category-level safety vulnerabilities. This paper introduces SalamaBench, a unified benchmark for evaluating the safety of ALMs, comprising $8,170$ prompts across $12$ different categories aligned with the MLCommons Safety Hazard Taxonomy. Constructed by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification, SalamaBench enables standardized, category-aware safety evaluation. Using this benchmark, we evaluate five stat

arXiv:2603.04410v1 Announce Type: new Abstract: Safety alignment in Language Models (LMs) is fundamental for trustworthy AI. However, while different stakeholders are trying to leverage Arabic Language Models (ALMs), systematic safety evaluation of ALMs remains largely underexplored, limiting their mainstream uptake. Existing safety benchmarks and safeguard models are predominantly English-centric, limiting their applicability to Arabic Natural Language Processing (NLP) systems and obscuring fine-grained, category-level safety vulnerabilities. This paper introduces SalamaBench, a unified benchmark for evaluating the safety of ALMs, comprising $8,170$ prompts across $12$ different categories aligned with the MLCommons Safety Hazard Taxonomy. Constructed by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification, SalamaBench enables standardized, category-aware safety evaluation. Using this benchmark, we evaluate five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, under multiple safeguard configurations, including individual guard models, majority-vote aggregation, and validation against human-annotated gold labels. Our results reveal substantial variation in safety alignment: while Fanar 2 achieves the lowest aggregate attack success rates, its robustness is uneven across specific harm domains. In contrast, Jais 2 consistently exhibits elevated vulnerability, indicating weaker intrinsic safety alignment. We further demonstrate that native ALMs perform substantially worse than dedicated safeguard models when acting as safety judges. Overall, our findings highlight the necessity of category-aware evaluation and specialized safeguard mechanisms for robust harm mitigation in ALMs.

Executive Summary

This article introduces SalamaBench, a unified benchmark for evaluating the safety of Arabic Language Models (ALMs). The authors constructed the benchmark by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification. Using SalamaBench, they evaluated five state-of-the-art ALMs, revealing substantial variation in safety alignment and highlighting the necessity of category-aware evaluation and specialized safeguard mechanisms for robust harm mitigation in ALMs. The study's findings have significant implications for the development and deployment of trustworthy AI in Arabic NLP systems.

Key Points

  • SalamaBench is a unified benchmark for evaluating the safety of Arabic Language Models (ALMs).
  • The benchmark is constructed by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification.
  • The study evaluates five state-of-the-art ALMs, revealing substantial variation in safety alignment.

Merits

Strength

The study's use of a unified benchmark enables standardized, category-aware safety evaluation, which is a significant improvement over existing English-centric safety benchmarks.

Methodological rigor

The authors employed a rigorous pipeline involving AI filtering and multi-stage human verification to construct the SalamaBench, ensuring the accuracy and reliability of the benchmark.

Relevance to real-world applications

The study's findings have significant implications for the development and deployment of trustworthy AI in Arabic NLP systems, which is crucial for applications such as language translation, sentiment analysis, and text generation.

Demerits

Limitation

The study only evaluated five state-of-the-art ALMs, which may not be representative of the broader landscape of ALMs available in the market.

Scalability

The construction of the SalamaBench involved a rigorous pipeline that may not be scalable to larger datasets or more complex models.

Expert Commentary

The study's contribution to the field of Arabic NLP is significant, as it addresses a critical gap in the literature on safety evaluation of ALMs. However, the study's limitations, such as the small sample size and scalability issues, should be acknowledged and addressed in future work. Additionally, the study's findings on safety alignment in ALMs have broader implications for the development of trustworthy AI solutions in general, and highlight the need for more research in this area.

Recommendations

  • Future studies should aim to evaluate a larger and more diverse set of ALMs to better understand the safety alignment of these models.
  • Developers should consider using the SalamaBench or similar benchmarks to evaluate the safety of their ALMs before deploying them in real-world applications.

Sources