Academic

TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

arXiv:2603.18008v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians. THERAPYGYM also serves as a training harness: CTRS an

arXiv:2603.18008v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians. THERAPYGYM also serves as a training harness: CTRS and safety-based rewards drive RL with configurable patient simulations spanning diverse symptom profiles. Models trained in THERAPYGYM improve on expert ratings, with average CTRS rising from 0.10 to 0.60 (and 0.16 to 0.59 under LLM judges). Our work enables scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes use.

Executive Summary

This article presents TherapyGym, a framework designed to evaluate and improve the clinical fidelity and safety of therapy chatbots. The framework utilizes the Cognitive Therapy Rating Scale (CTRS) to measure fidelity and a multi-label annotation scheme to assess safety. To mitigate bias, the authors release TherapyJudgeBench, a validation set of dialogues with expert ratings. TherapyGym serves as a training harness, driving reinforcement learning with configurable patient simulations. Models trained in TherapyGym demonstrate improved clinical fidelity and safety. The article's contributions are significant, as they address a critical gap in the development of therapy chatbots and enable the creation of safer, more faithful tools. The findings have important implications for the practical application of therapy chatbots and inform policy decisions regarding their use.

Key Points

  • TherapyGym is a framework for evaluating and improving the clinical fidelity and safety of therapy chatbots.
  • The framework utilizes the CTRS and a multi-label annotation scheme to assess fidelity and safety.
  • TherapyJudgeBench is a validation set of dialogues with expert ratings to mitigate bias and unreliability.

Merits

Strength in Clinical Evaluation

The use of the CTRS and multi-label annotation scheme provides a robust and clinically relevant evaluation of therapy chatbots, addressing a critical gap in the field.

Potential for Scalable Development

TherapyGym enables the scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes use.

Demerits

Limited Generalizability

The article's findings may have limited generalizability to other types of chatbots or therapy modalities.

Need for Further Validation

While the use of TherapyJudgeBench mitigates bias, further validation is necessary to ensure the reliability and accuracy of the framework.

Expert Commentary

The article presents a significant contribution to the field of therapy chatbots, addressing a critical gap in their evaluation and development. The use of the CTRS and multi-label annotation scheme provides a robust and clinically relevant evaluation framework, while TherapyJudgeBench mitigates bias and unreliability. However, the article's findings may have limited generalizability to other types of chatbots or therapy modalities, and further validation is necessary to ensure the reliability and accuracy of the framework. The implications of the article's findings are far-reaching, with potential applications in the practical development and policy regulation of therapy chatbots.

Recommendations

  • Further research is needed to explore the potential applications of TherapyGym in other areas of digital mental health, such as human-AI collaboration in therapy and digital mental health support.
  • The development of TherapyGym-like frameworks should be encouraged to ensure the safe and effective use of therapy chatbots in high-stakes applications.

Sources