Academic

From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction

arXiv:2603.00307v1 Announce Type: new Abstract: We test whether a geometric hallucination taxonomy -- classifying failures as center-drift (Type~1), wrong-well convergence (Type~2), or coverage gaps (Type~3) -- can distinguish hallucination types through controlled induction in GPT-2. Using a two-level statistical design with prompts ($N = 15$/group) as the unit of inference, we run each experiment 20 times with different generation seeds to quantify result stability. In static embeddings, Type~3 norm separation is robust (significant in 18/20 runs, Holm-corrected in 14/20, median $r = +0.61$). In contextual hidden states, the Type~3 norm effect direction is stable (19/20 runs) but underpowered at $N = 15$ (significant in 4/20, median $r = -0.28$). Types~1 and~2 do not separate in either space (${\leq}\,3/20$ runs). Token-level tests inflate significance by 4--16$\times$ through pseudoreplication -- a finding replicated across all 20 runs. The results establish coverage-gap hallucinat

M
Matic Korun
· · 1 min read · 16 views

arXiv:2603.00307v1 Announce Type: new Abstract: We test whether a geometric hallucination taxonomy -- classifying failures as center-drift (Type~1), wrong-well convergence (Type~2), or coverage gaps (Type~3) -- can distinguish hallucination types through controlled induction in GPT-2. Using a two-level statistical design with prompts ($N = 15$/group) as the unit of inference, we run each experiment 20 times with different generation seeds to quantify result stability. In static embeddings, Type~3 norm separation is robust (significant in 18/20 runs, Holm-corrected in 14/20, median $r = +0.61$). In contextual hidden states, the Type~3 norm effect direction is stable (19/20 runs) but underpowered at $N = 15$ (significant in 4/20, median $r = -0.28$). Types~1 and~2 do not separate in either space (${\leq}\,3/20$ runs). Token-level tests inflate significance by 4--16$\times$ through pseudoreplication -- a finding replicated across all 20 runs. The results establish coverage-gap hallucinations as the most geometrically distinctive failure mode, carried by magnitude rather than direction, and confirm the Type~1/2 non-separation as genuine at 124M parameters.

Executive Summary

The article investigates the validity of a geometric hallucination taxonomy—classifying failures as Type~1 (center-drift), Type~2 (wrong-well convergence), or Type~3 (coverage gaps)—using controlled induction in GPT-2. With a two-level statistical design across 15 prompts per group across 20 runs, the study quantifies stability of geometric distinctions. Results indicate Type~3 (coverage gaps) exhibits robust, statistically significant separation via norm separation in static embeddings (median r = +0.61, 18/20 runs), confirming its geometric distinctiveness. Type~1 and Type~2 fail to distinguish in either embedding space, suggesting non-separability at this scale. Token-level analyses reveal significant inflation of significance via pseudoreplication, a critical methodological caveat. The findings validate Type~3 as the most identifiable hallucination mode geometrically, while clarifying limitations in Type~1/2 discrimination, thereby informing future taxonomy development and evaluation protocols.

Key Points

  • Type~3 coverage gaps show robust geometric distinction via norm separation in static embeddings
  • Type~1 and Type~2 fail to separate in either embedding space
  • Token-level pseudoreplication inflates significance, revealing a methodological risk

Merits

Statistical Robustness

The study employs a rigorous two-level design with repeated sampling to quantify stability, enhancing credibility of findings.

Demerits

Power Limitation

Underpowered Type~3 analysis in contextual hidden states (N=15) limits ability to detect significant differences despite stable effect direction.

Expert Commentary

This work represents a meaningful contribution to the empirical validation of hallucination taxonomies in large language models. The clear empirical differentiation of Type~3 hallucinations via geometric norm separation aligns with theoretical expectations and offers practical guidance for improving diagnostic accuracy in LLM failure analysis. However, the study’s underpowered contextual hidden state analysis—though acknowledged—requires careful interpretation; future work should expand sample sizes or use alternative inference methods to validate contextual effects with greater precision. Importantly, the acknowledgment of token-level pseudoreplication as a replicable artifact across runs demonstrates methodological transparency and rigor. This combination of empirical validation, methodological candor, and clear taxonomy delineation positions the paper as a benchmark for subsequent hallucination classification studies.

Recommendations

  • Adopt Type~3 as a primary indicator in hallucination diagnostic protocols
  • Design future studies with larger sample sizes or mixed-methods inference to better capture contextual effects

Sources