Academic

MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

arXiv:2602.12871v1 Announce Type: new Abstract: We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evalua

Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, KyungTae Lim · March 7, 2026 · 1 min read · 2 views

#cs.CL

Executive Summary

The article introduces MentalBench, a novel benchmark designed to evaluate the psychiatric diagnostic capabilities of large language models (LLMs). Unlike existing benchmarks that rely on social media data, MentalBench utilizes MentalKG, a knowledge graph built and validated by psychiatrists, which encodes DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. The benchmark generates 24,750 synthetic clinical cases to assess LLMs' performance in diagnostic decision-making. The study finds that while LLMs excel in structured queries, they struggle with confidence calibration when distinguishing between clinically overlapping disorders, highlighting gaps in current evaluation methods.

Key Points

▸ Introduction of MentalBench for evaluating psychiatric diagnostic capabilities of LLMs.
▸ Use of MentalKG, a psychiatrist-built knowledge graph, as a golden-standard logical backbone.
▸ Generation of 24,750 synthetic clinical cases to assess diagnostic performance.
▸ LLMs perform well on structured queries but struggle with confidence calibration in complex cases.
▸ Identification of evaluation gaps not captured by existing benchmarks.

Merits

Comprehensive Benchmark

MentalBench provides a rigorous and systematic approach to evaluating LLMs' psychiatric diagnostic capabilities, addressing a significant gap in the current literature.

Expert-Validated Knowledge Graph

The use of MentalKG, built and validated by psychiatrists, ensures that the benchmark is grounded in clinically relevant and accurate diagnostic criteria.

Interpretable Evaluation

The synthetic clinical cases generated by MentalBench enable low-noise and interpretable evaluation, providing clear insights into LLMs' diagnostic decision-making processes.

Demerits

Limited Scope of Disorders

The benchmark currently covers only 23 psychiatric disorders, which may not fully represent the breadth of psychiatric conditions encountered in clinical practice.

Synthetic Cases May Not Reflect Real-World Complexity

While synthetic cases offer controlled evaluation, they may not fully capture the complexity and nuance of real-world clinical scenarios.

Potential Bias in Knowledge Graph

The knowledge graph, despite being expert-validated, may still contain biases or limitations inherent in the DSM-5 criteria or the experts' interpretations.

Expert Commentary

The introduction of MentalBench represents a significant advancement in the evaluation of LLMs' psychiatric diagnostic capabilities. By leveraging a psychiatrist-built knowledge graph and generating synthetic clinical cases, the benchmark provides a robust and interpretable framework for assessing diagnostic decision-making. The study's findings reveal that while LLMs perform well on structured queries, they struggle with confidence calibration in complex cases, highlighting the need for further refinement and calibration. This research not only addresses a critical gap in the current literature but also underscores the importance of interdisciplinary collaboration in AI development. The practical implications of this work are profound, as it paves the way for more accurate and reliable AI tools in psychiatric diagnostics. However, the study also raises important ethical and policy considerations, emphasizing the need for regulatory frameworks to ensure the safe and effective use of AI in healthcare settings.

Recommendations

✓ Expand the scope of MentalBench to include a broader range of psychiatric disorders to better reflect real-world clinical scenarios.
✓ Incorporate real-world clinical data alongside synthetic cases to enhance the benchmark's validity and applicability.

Sources

arXiv - cs.CL

MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Benchmark

Expert-Validated Knowledge Graph

Interpretable Evaluation

Demerits

Limited Scope of Disorders

Synthetic Cases May Not Reflect Real-World Complexity

Potential Bias in Knowledge Graph

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs