TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models
arXiv:2602.22827v1 Announce Type: new Abstract: This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understand
arXiv:2602.22827v1 Announce Type: new Abstract: This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.
Executive Summary
The article introduces TARAZ, a novel Persian short-answer question benchmark designed to evaluate the cultural competence of large language models (LLMs) in Persian. The authors argue that existing benchmarks, which rely on multiple-choice formats and English-centric metrics, are inadequate for capturing the morphological complexity and semantic nuances of Persian. TARAZ employs a hybrid approach combining rule-based morphological normalization with syntactic and semantic similarity modules, enabling robust soft-match scoring. The study evaluates 15 state-of-the-art models and demonstrates a 10% improvement in scoring consistency compared to exact-match baselines. The framework is publicly released to standardize the evaluation of cultural understanding in Persian and to support cross-cultural LLM research.
Key Points
- ▸ Introduction of TARAZ, a Persian-specific short-answer question benchmark for evaluating cultural competence in LLMs.
- ▸ Critique of existing benchmarks for their reliance on multiple-choice formats and English-centric metrics.
- ▸ Hybrid evaluation approach combining morphological normalization with syntactic and semantic similarity modules.
- ▸ Evaluation of 15 state-of-the-art models, showing a 10% improvement in scoring consistency.
- ▸ Public release of the TARAZ framework to standardize and support cross-cultural LLM research.
Merits
Innovative Approach
The hybrid evaluation method is a significant advancement over existing benchmarks, as it captures semantic nuances and morphological complexities specific to Persian.
Comprehensive Evaluation
The study evaluates a wide range of state-of-the-art models, providing a robust comparison and demonstrating the effectiveness of the TARAZ framework.
Public Release
The public release of the TARAZ framework ensures reproducibility and encourages further research in cross-cultural LLM evaluation.
Demerits
Limited Scope
The focus on Persian, while valuable, limits the immediate applicability of the findings to other languages and cultures.
Model Selection
The selection of 15 models, while comprehensive, may not fully represent the diversity of LLMs available, potentially limiting the generalizability of the results.
Complexity of Implementation
The hybrid evaluation method, while effective, may be complex to implement and require significant computational resources.
Expert Commentary
The introduction of the TARAZ framework represents a significant step forward in the evaluation of cultural competence in large language models. The hybrid approach, combining morphological normalization with syntactic and semantic similarity, addresses a critical gap in existing benchmarks, which often fail to capture the nuances of Persian. The study's rigorous evaluation of 15 state-of-the-art models demonstrates the framework's effectiveness, showing a 10% improvement in scoring consistency. The public release of the TARAZ framework is particularly noteworthy, as it provides a standardized tool for researchers and developers to build upon. However, the focus on Persian limits the immediate applicability of the findings to other languages, and the complexity of the hybrid evaluation method may pose challenges for implementation. Despite these limitations, the study's contributions are substantial and pave the way for further research in cross-cultural LLM evaluation. The implications for practical applications and policy are significant, as the development of culturally competent AI systems is essential for ensuring fairness and inclusivity in AI technologies.
Recommendations
- ✓ Expand the TARAZ framework to include other languages and cultures to broaden its applicability and impact.
- ✓ Conduct further research to simplify the implementation of the hybrid evaluation method, making it more accessible to a wider range of researchers and developers.