Academic

Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses

arXiv:2604.06216v1 Announce Type: new Abstract: As LLM-powered chatbots are increasingly deployed in mental health services, detecting hallucinations and omissions has become critical for user safety. However, state-of-the-art LLM-as-a-judge methods often fail in high-risk healthcare contexts, where subtle errors can have serious consequences. We show that leading LLM judges achieve only 52% accuracy on mental health counseling data, with some hallucination detection approaches exhibiting near-zero recall. We identify the root cause as LLMs' inability to capture nuanced linguistic and therapeutic patterns recognized by domain experts. To address this, we propose a framework that integrates human expertise with LLMs to extract interpretable, domain-informed features across five analytical dimensions: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness. Experiments on a public mental health dataset and a new human-an

arXiv:2604.06216v1 Announce Type: new Abstract: As LLM-powered chatbots are increasingly deployed in mental health services, detecting hallucinations and omissions has become critical for user safety. However, state-of-the-art LLM-as-a-judge methods often fail in high-risk healthcare contexts, where subtle errors can have serious consequences. We show that leading LLM judges achieve only 52% accuracy on mental health counseling data, with some hallucination detection approaches exhibiting near-zero recall. We identify the root cause as LLMs' inability to capture nuanced linguistic and therapeutic patterns recognized by domain experts. To address this, we propose a framework that integrates human expertise with LLMs to extract interpretable, domain-informed features across five analytical dimensions: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness. Experiments on a public mental health dataset and a new human-annotated dataset show that traditional machine learning models trained on these features achieve 0.717 F1 on our custom dataset and 0.849 F1 on a public benchmark for hallucination detection, with 0.59-0.64 F1 for omission detection across both datasets. Our results demonstrate that combining domain expertise with automated methods yields more reliable and transparent evaluation than black-box LLM judging in high-stakes mental health applications.

Executive Summary

This article critically examines the limitations of LLM-as-a-judge methods for detecting hallucinations and omissions in mental health chatbot responses, revealing a dismal 52% accuracy and near-zero recall for some hallucination detection approaches. It attributes this failure to LLMs' inability to discern nuanced linguistic and therapeutic patterns. The authors propose an innovative hybrid framework, integrating human expertise to extract interpretable, domain-informed features across five analytical dimensions. This approach, leveraging traditional machine learning on these features, significantly improves F1 scores for hallucination (0.717-0.849) and omission (0.59-0.64) detection, advocating for a more reliable and transparent evaluation method over black-box LLM judging in high-stakes mental health applications.

Key Points

  • LLM-as-a-judge methods are largely inadequate for detecting hallucinations and omissions in mental health chatbots, achieving only 52% accuracy.
  • The core issue is LLMs' inability to capture subtle linguistic and therapeutic nuances critical in mental health contexts.
  • A novel framework is proposed, integrating human expertise to define five analytical dimensions for feature extraction: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness.
  • Traditional machine learning models trained on these human-informed features demonstrate significantly improved performance in detecting both hallucinations and omissions.
  • The study advocates for hybrid human-LLM approaches to ensure more reliable and transparent evaluation in high-stakes healthcare AI.
  • The research utilizes both a public mental health dataset and a newly human-annotated dataset for experimental validation.

Merits

Addresses a Critical Gap

Directly tackles the pressing and under-addressed problem of ensuring safety and reliability in increasingly deployed LLM-powered mental health applications.

Rigorous Problem Identification

Clearly identifies the fundamental limitations of 'LLM-as-a-judge' in high-stakes domains, providing empirical evidence (52% accuracy, near-zero recall) for its inadequacy.

Innovative Hybrid Approach

Proposes a well-structured and intuitive framework that effectively marries human domain expertise with automated methods, moving beyond purely algorithmic solutions.

Interpretable Features

The focus on interpretable, domain-informed features (e.g., professional appropriateness) significantly enhances transparency and explainability, crucial for healthcare.

Empirical Validation

Demonstrates substantial performance improvements with the proposed method across multiple datasets, providing strong evidence for its efficacy.

Ethical Implications

Emphasizes user safety and ethical considerations, aligning with responsible AI development in sensitive sectors.

Demerits

Scalability of Human Annotation

The reliance on human experts for feature definition and data annotation, while effective, raises questions about the scalability and cost of this approach for continuous monitoring or very large datasets.

Generalizability Across Therapeutic Modalities

While 'mental health counseling data' is mentioned, the article doesn't explicitly detail the specific therapeutic modalities or theoretical orientations represented, which could influence feature relevance.

Definition of 'Domain Expert'

The criteria for 'domain expert' are not explicitly detailed. The quality and consistency of human input are paramount, and variations could impact feature extraction robustness.

Computational Overhead

The process of extracting five dimensions of features and then training traditional ML models might introduce computational overhead compared to direct LLM judging, though justified by accuracy.

Nuance of 'Omission'

Detecting omissions can be inherently more challenging and subjective than hallucinations. While improved, the F1 scores for omissions (0.59-0.64) suggest there's still considerable room for improvement in this particularly difficult area.

Expert Commentary

This article offers a timely and profoundly important intervention in the burgeoning field of AI in mental health. The authors' rigorous demonstration of LLM-as-a-judge's severe limitations, particularly its inability to grasp therapeutic nuance, is a stark warning against naive deployment. Their proposed hybrid framework, which intelligently leverages human expertise to distill interpretable features, represents a significant methodological leap. It rightly prioritizes transparency and domain specificity, which are paramount in healthcare. The substantial performance gains underscore that for high-stakes applications, a purely algorithmic approach to safety validation is insufficient and potentially dangerous. While scalability of human input remains a practical challenge, the foundational principle – that human domain knowledge must explicitly guide AI evaluation in sensitive contexts – is irrefutable. This work sets a new benchmark for responsible AI development in mental health, demanding a shift from black-box evaluations to explainable, human-informed assurance mechanisms.

Recommendations

  • Future research should explore semi-supervised or active learning techniques to reduce the reliance on extensive human annotation while maintaining the quality of domain-informed feature extraction.
  • Investigate the generalizability of the proposed framework across diverse therapeutic modalities (e.g., CBT, psychodynamic, humanistic) and cultural contexts to ensure broad applicability.
  • Develop standardized guidelines and training protocols for 'domain experts' involved in annotating and feature definition to ensure consistency and reliability of human input.
  • Explore the integration of real-time monitoring capabilities, where the human-informed detection system can flag potentially problematic chatbot responses for human review before delivery.
  • Conduct a comprehensive cost-benefit analysis of implementing this human-in-the-loop framework versus the risks associated with undetected errors from purely LLM-based judging in clinical settings.

Sources

Original: arXiv - cs.CL