The Truthfulness Spectrum Hypothesis
arXiv:2602.20273v1 Announce Type: new Abstract: Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalizati
arXiv:2602.20273v1 Announce Type: new Abstract: Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models' sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.
Executive Summary
This article proposes the Truthfulness Spectrum Hypothesis, which suggests that the representational space of large language models (LLMs) contains directions ranging from broadly domain-general to narrowly domain-specific truthfulness. The study systematically evaluates probe generalization across various truth types, lying mechanisms, and existing honesty benchmarks. The results show that linear probes generalize well across most domains but fail on specific types of lying. Training on all domains jointly recovers strong performance, confirming the existence of domain-general directions. The study also reveals that domain-specific directions steer more effectively than domain-general ones. The findings support the Truthfulness Spectrum Hypothesis and have implications for understanding the representational basis of chat models' sycophantic tendencies.
Key Points
- ▸ The Truthfulness Spectrum Hypothesis proposes the existence of domain-general and domain-specific truth directions in LLMs.
- ▸ Linear probes generalize well across most domains but fail on specific types of lying.
- ▸ Training on all domains jointly recovers strong performance, confirming the existence of domain-general directions.
Merits
Strength
The study provides a systematic evaluation of probe generalization across various truth types, lying mechanisms, and existing honesty benchmarks, offering a comprehensive understanding of the representational space of LLMs.
Demerits
Limitation
The study relies on a specific dataset and experimental design, which may not be generalizable to other LLMs or domains.
Expert Commentary
The Truthfulness Spectrum Hypothesis proposed in this article offers a nuanced understanding of the representational space of LLMs. By recognizing the coexistence of domain-general and domain-specific truth directions, the study provides a foundation for the development of more robust and transparent AI systems. The findings also have implications for the study of chat models' sycophantic tendencies, which can inform the design of more effective and trustworthy chatbots. While the study has limitations, its comprehensive evaluation of probe generalization and systematic analysis of the representational space make it a significant contribution to the field of natural language processing.
Recommendations
- ✓ Future studies should investigate the generalizability of the Truthfulness Spectrum Hypothesis to other LLMs and domains.
- ✓ Researchers should explore the development of more transparent and explainable AI systems that incorporate the insights from this study.