Skip to main content
Academic

Unmasking the Factual-Conceptual Gap in Persian Language Models

arXiv:2602.17623v1 Announce Type: new Abstract: While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model's ability to discern contradictions; and all models show a 21\% performance gap between retrieving factual knowledge and applying it in scenarios. These findings dem

A
Alireza Sakhaeirad, Ali Ma'manpoosh, Arshia Hemmat
· · 1 min read · 15 views

arXiv:2602.17623v1 Announce Type: new Abstract: While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model's ability to discern contradictions; and all models show a 21\% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.

Executive Summary

This study sheds light on the limitations of Persian language models in understanding cultural norms and social conventions. By introducing a diagnostic benchmark, DivanBench, the researchers evaluate seven Persian LLMs and reveal significant shortcomings, including acquiescence bias and a performance gap between factual knowledge and scenario application. The findings suggest that current models learn to mimic cultural patterns without internalizing underlying schemas, underscoring the need for more nuanced approaches to cultural competence in language models. This research has far-reaching implications for the development of culturally sensitive AI systems and highlights the importance of considering the complexities of human social behavior in NLP.

Key Points

  • The study introduces DivanBench, a diagnostic benchmark for Persian language models to evaluate cultural competence.
  • The researchers reveal three critical failures of current Persian LLMs: acquiescence bias, amplification of bias through continuous pretraining, and a performance gap between factual knowledge and scenario application.
  • The findings demonstrate that cultural competence requires more than scaling monolingual data and highlight the need for more nuanced approaches to cultural competence in language models.

Merits

Innovative Benchmark Development

The study introduces a novel diagnostic benchmark, DivanBench, which provides a unique assessment of cultural competence in Persian language models.

Insightful Analysis of LLM Limitations

The researchers' evaluation of seven Persian LLMs reveals significant shortcomings, providing valuable insights into the limitations of current language models.

Demerits

Limited Generalizability

The study focuses on Persian language models, which may not be generalizable to other languages or cultural contexts.

Methodological Limitations

The researchers rely on a single benchmark and a limited number of models, which may not be representative of the broader population of language models.

Expert Commentary

The study's findings have significant implications for the development of culturally sensitive AI systems. The researchers' emphasis on the need for more nuanced approaches to cultural competence in language models highlights the importance of considering the complexities of human social behavior in NLP. The introduction of DivanBench provides a valuable tool for the evaluation of cultural competence in language models, and the study's insights into the limitations of current models will inform the development of more culturally aware AI systems. However, the study's focus on Persian language models may limit its generalizability, and the methodological limitations of the study should be taken into consideration when interpreting the results.

Recommendations

  • Researchers should prioritize the development of more nuanced approaches to cultural competence in language models, incorporating social norm understanding and cultural awareness into AI systems.
  • Developers should invest in the creation of culturally sensitive AI systems, considering the complexities of human social behavior and the need for more nuanced cultural understanding.

Sources