ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
arXiv:2603.04992v1 Announce Type: new Abstract: The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators. Our results show that closed-source models generally demonstrate stronger safety performance than open-source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for Thai-specific, culturally contextualized attacks compared
arXiv:2603.04992v1 Announce Type: new Abstract: The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators. Our results show that closed-source models generally demonstrate stronger safety performance than open-source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for Thai-specific, culturally contextualized attacks compared to general Thai-language attacks, highlighting a critical vulnerability in current safety alignment methods. To improve reproducibility and cost efficiency, we further fine-tune a DeBERTa-based harmful response classifier, which we name ThaiSafetyClassifier. The model achieves a weighted F1 score of 84.4%, matching GPT-4.1 judgments. We publicly release the fine-tuning weights and training scripts to support reproducibility. Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation. - ThaiSafetyBench HuggingFace Dataset: https://huggingface.co/datasets/typhoon-ai/ThaiSafetyBench - ThaiSafetyBench Github: https://github.com/trapoom555/ThaiSafetyBench - ThaiSafetyClassifier HuggingFace Model: https://huggingface.co/typhoon-ai/ThaiSafetyClassifier - ThaiSafetyBench Leaderboard: https://huggingface.co/spaces/typhoon-ai/ThaiSafetyBench-Leaderboard
Executive Summary
This article introduces ThaiSafetyBench, an open-source benchmark for assessing language model safety in Thai cultural contexts. The authors evaluate 24 language models and observe that closed-source models demonstrate stronger safety performance than open-source counterparts. They also identify a critical vulnerability in current safety alignment methods due to a higher Attack Success Rate for Thai-specific, culturally contextualized attacks. The study proposes ThaiSafetyClassifier, a fine-tuned DeBERTa-based harmful response classifier, achieving a weighted F1 score of 84.4%. The authors release the benchmark, classifier, and leaderboard to facilitate reproducibility and community participation.
Key Points
- ▸ ThaiSafetyBench is an open-source benchmark for assessing language model safety in Thai cultural contexts
- ▸ Closed-source models demonstrate stronger safety performance than open-source counterparts
- ▸ Thai-specific, culturally contextualized attacks have a higher Attack Success Rate than general Thai-language attacks
- ▸ ThaiSafetyClassifier achieves a weighted F1 score of 84.4% in detecting harmful responses
Merits
Strength in addressing linguistic and cultural gaps
The study fills a significant gap in the field by focusing on non-English languages and culturally grounded risks. ThaiSafetyBench and ThaiSafetyClassifier can be applied to other languages and cultural contexts, broadening the scope of language model safety evaluation.
Improved reproducibility and cost efficiency
The authors release the benchmark, classifier, and leaderboard, making it easier for researchers to reproduce and build upon their work. This facilitates collaboration and accelerates progress in language model safety research.
Demerits
Limited generalizability to other languages
While ThaiSafetyBench is a significant contribution, its effectiveness in other languages and cultural contexts is unknown. Further research is needed to adapt and validate the benchmark and classifier for diverse linguistic and cultural settings.
Dependence on specific model architectures
The study relies on DeBERTa-based models, which may not generalize to other architectures. Further investigation is required to determine the applicability of ThaiSafetyBench and ThaiSafetyClassifier to a broader range of models.
Expert Commentary
The study provides a significant contribution to the field of language model safety by addressing the linguistic and cultural gaps in current research. However, its effectiveness in other languages and cultural contexts is unknown, and further research is needed to adapt and validate the benchmark and classifier. The findings have significant implications for developers, researchers, and policymakers, highlighting the importance of considering cultural and linguistic nuances in language model safety evaluation. The ThaiSafetyBench and ThaiSafetyClassifier can serve as a foundation for future work in this area, but their limitations should be carefully considered.
Recommendations
- ✓ Develop and adapt ThaiSafetyBench and ThaiSafetyClassifier for other languages and cultural contexts
- ✓ Investigate the applicability of ThaiSafetyBench and ThaiSafetyClassifier to a broader range of model architectures