MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models
arXiv:2602.12871v1 Announce Type: new Abstract: We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evalua
arXiv:2602.12871v1 Announce Type: new Abstract: We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evaluation gaps not captured by existing benchmarks.
Executive Summary
The article introduces MentalBench, a novel benchmark designed to evaluate the psychiatric diagnostic capabilities of large language models (LLMs). Unlike existing benchmarks that rely on social media data, MentalBench utilizes MentalKG, a knowledge graph built and validated by psychiatrists, which encodes DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. The benchmark generates 24,750 synthetic clinical cases to assess LLMs' performance in diagnostic decision-making. The study finds that while LLMs excel in structured queries, they struggle with confidence calibration when distinguishing between clinically overlapping disorders, highlighting gaps in current evaluation methods.
Key Points
- ▸ Introduction of MentalBench for evaluating psychiatric diagnostic capabilities of LLMs.
- ▸ Use of MentalKG, a psychiatrist-built knowledge graph, as a golden-standard logical backbone.
- ▸ Generation of 24,750 synthetic clinical cases to assess diagnostic performance.
- ▸ LLMs perform well on structured queries but struggle with confidence calibration in complex cases.
- ▸ Identification of evaluation gaps not captured by existing benchmarks.
Merits
Comprehensive Benchmark
MentalBench provides a rigorous and systematic approach to evaluating LLMs' psychiatric diagnostic capabilities, addressing a significant gap in the current literature.
Expert-Validated Knowledge Graph
The use of MentalKG, built and validated by psychiatrists, ensures that the benchmark is grounded in clinically relevant and accurate diagnostic criteria.
Interpretable Evaluation
The synthetic clinical cases generated by MentalBench enable low-noise and interpretable evaluation, providing clear insights into LLMs' diagnostic decision-making processes.
Demerits
Limited Scope of Disorders
The benchmark currently covers only 23 psychiatric disorders, which may not fully represent the breadth of psychiatric conditions encountered in clinical practice.
Synthetic Cases May Not Reflect Real-World Complexity
While synthetic cases offer controlled evaluation, they may not fully capture the complexity and nuance of real-world clinical scenarios.
Potential Bias in Knowledge Graph
The knowledge graph, despite being expert-validated, may still contain biases or limitations inherent in the DSM-5 criteria or the experts' interpretations.
Expert Commentary
The introduction of MentalBench represents a significant advancement in the evaluation of LLMs' psychiatric diagnostic capabilities. By leveraging a psychiatrist-built knowledge graph and generating synthetic clinical cases, the benchmark provides a robust and interpretable framework for assessing diagnostic decision-making. The study's findings reveal that while LLMs perform well on structured queries, they struggle with confidence calibration in complex cases, highlighting the need for further refinement and calibration. This research not only addresses a critical gap in the current literature but also underscores the importance of interdisciplinary collaboration in AI development. The practical implications of this work are profound, as it paves the way for more accurate and reliable AI tools in psychiatric diagnostics. However, the study also raises important ethical and policy considerations, emphasizing the need for regulatory frameworks to ensure the safe and effective use of AI in healthcare settings.
Recommendations
- ✓ Expand the scope of MentalBench to include a broader range of psychiatric disorders to better reflect real-world clinical scenarios.
- ✓ Incorporate real-world clinical data alongside synthetic cases to enhance the benchmark's validity and applicability.