Skip to main content
Academic

TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

arXiv:2602.22827v1 Announce Type: new Abstract: This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understand

arXiv:2602.22827v1 Announce Type: new Abstract: This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.

Executive Summary

The article introduces TARAZ, a novel Persian short-answer question benchmark designed to evaluate the cultural competence of large language models (LLMs) in Persian. The authors argue that existing benchmarks, which rely on multiple-choice formats and English-centric metrics, are inadequate for capturing the morphological complexity and semantic nuances of Persian. TARAZ employs a hybrid approach combining rule-based morphological normalization with syntactic and semantic similarity modules, enabling robust soft-match scoring. The study evaluates 15 state-of-the-art models and demonstrates a 10% improvement in scoring consistency compared to exact-match baselines. The framework is publicly released to standardize the evaluation of cultural understanding in Persian and to support cross-cultural LLM research.

Key Points

  • Introduction of TARAZ, a Persian-specific short-answer question benchmark for evaluating cultural competence in LLMs.
  • Critique of existing benchmarks for their reliance on multiple-choice formats and English-centric metrics.
  • Hybrid evaluation approach combining morphological normalization with syntactic and semantic similarity modules.
  • Evaluation of 15 state-of-the-art models, showing a 10% improvement in scoring consistency.
  • Public release of the TARAZ framework to standardize and support cross-cultural LLM research.

Merits

Innovative Approach

The hybrid evaluation method is a significant advancement over existing benchmarks, as it captures semantic nuances and morphological complexities specific to Persian.

Comprehensive Evaluation

The study evaluates a wide range of state-of-the-art models, providing a robust comparison and demonstrating the effectiveness of the TARAZ framework.

Public Release

The public release of the TARAZ framework ensures reproducibility and encourages further research in cross-cultural LLM evaluation.

Demerits

Limited Scope

The focus on Persian, while valuable, limits the immediate applicability of the findings to other languages and cultures.

Model Selection

The selection of 15 models, while comprehensive, may not fully represent the diversity of LLMs available, potentially limiting the generalizability of the results.

Complexity of Implementation

The hybrid evaluation method, while effective, may be complex to implement and require significant computational resources.

Expert Commentary

The introduction of the TARAZ framework represents a significant step forward in the evaluation of cultural competence in large language models. The hybrid approach, combining morphological normalization with syntactic and semantic similarity, addresses a critical gap in existing benchmarks, which often fail to capture the nuances of Persian. The study's rigorous evaluation of 15 state-of-the-art models demonstrates the framework's effectiveness, showing a 10% improvement in scoring consistency. The public release of the TARAZ framework is particularly noteworthy, as it provides a standardized tool for researchers and developers to build upon. However, the focus on Persian limits the immediate applicability of the findings to other languages, and the complexity of the hybrid evaluation method may pose challenges for implementation. Despite these limitations, the study's contributions are substantial and pave the way for further research in cross-cultural LLM evaluation. The implications for practical applications and policy are significant, as the development of culturally competent AI systems is essential for ensuring fairness and inclusivity in AI technologies.

Recommendations

  • Expand the TARAZ framework to include other languages and cultures to broaden its applicability and impact.
  • Conduct further research to simplify the implementation of the hybrid evaluation method, making it more accessible to a wider range of researchers and developers.

Sources