Academic

Language Shapes Mental Health Evaluations in Large Language Models

arXiv:2603.06910v1 Announce Type: new Abstract: This study investigates whether large language models (LLMs) exhibit cross-linguistic differences in mental health evaluations. Focusing on Chinese and English, we examine two widely used models, GPT-4o and Qwen3, to assess whether prompt language systematically shifts mental health-related evaluations and downstream decision outcomes. First, we assess models' evaluative orientation toward mental health stigma using multiple validated measurement scales capturing social stigma, self-stigma, and professional stigma. Across all measures, both models produce higher stigma-related responses when prompted in Chinese than in English. Second, we examine whether these differences also manifest in two common downstream decision tasks in mental health. In a binary mental health stigma detection task, sensitivity to stigmatizing content varies across language prompts, with lower sensitivity observed under Chinese prompts. In a depression severity c

J
Jiayi Xu, Xiyang Hu
· · 1 min read · 22 views

arXiv:2603.06910v1 Announce Type: new Abstract: This study investigates whether large language models (LLMs) exhibit cross-linguistic differences in mental health evaluations. Focusing on Chinese and English, we examine two widely used models, GPT-4o and Qwen3, to assess whether prompt language systematically shifts mental health-related evaluations and downstream decision outcomes. First, we assess models' evaluative orientation toward mental health stigma using multiple validated measurement scales capturing social stigma, self-stigma, and professional stigma. Across all measures, both models produce higher stigma-related responses when prompted in Chinese than in English. Second, we examine whether these differences also manifest in two common downstream decision tasks in mental health. In a binary mental health stigma detection task, sensitivity to stigmatizing content varies across language prompts, with lower sensitivity observed under Chinese prompts. In a depression severity classification task, predicted severity also differs by prompt language, with Chinese prompts associated with more underestimation errors, indicating a systematic downward shift in predicted severity relative to English prompts. Together, these findings suggest that language context can systematically shape evaluative patterns in LLM outputs and shift decision thresholds in downstream tasks.

Executive Summary

This study examines the impact of language on mental health evaluations in large language models, specifically GPT-4o and Qwen3. The results show that language context systematically shapes evaluative patterns, with Chinese prompts producing higher stigma-related responses and lower sensitivity to stigmatizing content compared to English prompts. The study highlights the importance of considering language context in mental health evaluations and downstream decision tasks.

Key Points

  • Cross-linguistic differences in mental health evaluations exist in large language models
  • Chinese prompts produce higher stigma-related responses and lower sensitivity to stigmatizing content
  • Language context affects downstream decision tasks, including mental health stigma detection and depression severity classification

Merits

Novel Contribution

The study provides new insights into the impact of language on mental health evaluations in large language models

Methodological Rigor

The study uses multiple validated measurement scales and examines two widely used models

Demerits

Limited Generalizability

The study only examines two languages and two models, limiting the generalizability of the findings

Lack of Theoretical Framework

The study could benefit from a more comprehensive theoretical framework to explain the observed differences

Expert Commentary

The study's findings have significant implications for the development and deployment of AI systems in mental health evaluations. The existence of language bias and cross-linguistic differences highlights the need for culturally competent AI systems that can account for language context. Furthermore, the study's results underscore the importance of transparency and accountability in AI decision-making processes, particularly in high-stakes applications such as mental health evaluations. As AI systems become increasingly ubiquitous in healthcare, it is essential to prioritize the development of fair, transparent, and culturally competent systems that can provide accurate and effective support for diverse populations.

Recommendations

  • Conducting further research on the impact of language on mental health evaluations in AI systems
  • Developing and implementing guidelines for culturally competent AI system development and deployment
  • Encouraging interdisciplinary collaboration between AI researchers, mental health professionals, and cultural experts to develop more effective and inclusive AI systems

Sources