Academic

The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

arXiv:2603.09989v1 Announce Type: cross Abstract: We introduce the System Hallucination Scale (SHS), a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs). Inspired by established psychometric tools such as the System Usability Scale (SUS) and the System Causability Scale (SCS), SHS enables rapid, interpretable, and domain-agnostic evaluation of factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance in model-generated text. SHS is explicitly not an automatic hallucination detector or benchmark metric; instead, it captures how hallucination phenomena manifest from a user perspective under realistic interaction conditions. A real-world evaluation with 210 participants demonstrates high clarity, coherent response behavior, and construct validity, supported by statistical analysis including internal consistency (Cronbach's alpha = 0.87$) and significant inter-dimension

arXiv:2603.09989v1 Announce Type: cross Abstract: We introduce the System Hallucination Scale (SHS), a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs). Inspired by established psychometric tools such as the System Usability Scale (SUS) and the System Causability Scale (SCS), SHS enables rapid, interpretable, and domain-agnostic evaluation of factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance in model-generated text. SHS is explicitly not an automatic hallucination detector or benchmark metric; instead, it captures how hallucination phenomena manifest from a user perspective under realistic interaction conditions. A real-world evaluation with 210 participants demonstrates high clarity, coherent response behavior, and construct validity, supported by statistical analysis including internal consistency (Cronbach's alpha = 0.87$) and significant inter-dimension correlations (p < 0.001$). Comparative analysis with SUS and SCS reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.

Executive Summary

The System Hallucination Scale (SHS) introduces a novel, human-centered instrument designed to evaluate hallucination-related behavior in large language models (LLMs) by drawing inspiration from established psychometric frameworks such as the SUS and SCS. Unlike automatic detection tools, SHS focuses on user-perceived manifestations of hallucinations under realistic interaction contexts, offering a lightweight, interpretable, and domain-agnostic evaluation mechanism. A real-world study with 210 participants supports its validity through strong internal consistency (Cronbach's alpha = 0.87) and significant inter-dimension correlations, positioning SHS as a practical complement to existing evaluation methods for iterative development and deployment monitoring. The SHS fills a critical gap by capturing subjective user experience without replacing existing benchmark metrics.

Key Points

  • SHS is a user-centric, lightweight tool for evaluating hallucination-related behavior in LLMs
  • Inspired by SUS and SCS, it focuses on subjective user perception rather than automatic detection
  • Validated with high internal consistency and inter-dimension correlations in a real-world study

Merits

User-Centric Design

SHS addresses a critical need by evaluating hallucination phenomena from the user’s perspective, aligning with practical interaction realities.

Complementary Value

By offering a distinct perspective from automatic hallucination detectors and benchmark metrics, SHS enhances the diversity of evaluation tools available to researchers and developers.

Demerits

Scope Limitation

SHS is not a hallucination detector or quantitative benchmark; its subjective focus may limit its applicability in technical or algorithmic evaluation scenarios.

Interpretive Dependence

Results are contingent on user interpretation and may vary across different user demographics or interaction contexts.

Expert Commentary

The SHS represents a significant methodological advancement in the evaluation of AI hallucination phenomena by shifting focus from algorithmic accuracy to user perception. In an era where AI systems are increasingly integrated into critical domains—from legal analysis to healthcare—understanding how users experience hallucinations is as vital as detecting them technically. The SHS fills a void by offering a scalable, interpretable, and human-centered metric that complements existing quantitative metrics without displacing them. Its validation parameters—particularly Cronbach’s alpha of 0.87—demonstrate robust reliability, and the comparative analysis with SUS and SCS validates its conceptual alignment with established psychometric standards. Moreover, the fact that it enables comparative analysis across different AI systems without requiring technical access to model architecture is a major operational advantage. This tool may become a standard reference in user-centered AI evaluation, particularly for developers seeking to improve system usability and mitigate user misinterpretation risks. Its impact extends beyond academia into applied AI ethics, where transparency and user trust are paramount.

Recommendations

  • Adopt SHS as a supplementary evaluation metric in AI development pipelines for user-experience validation.
  • Encourage interdisciplinary collaboration between AI researchers, UX designers, and ethics experts to refine SHS for domain-specific applications.

Sources