How LLMs Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing and Methods to Detect Phantom Citations
arXiv:2603.03299v1 Announce Type: new Abstract: Large language models (LLMs) have been noted to fabricate scholarly citations, yet the scope of this behavior across providers, domains, and prompting conditions remains poorly quantified. We present one of the largest citation hallucination audits to date, in which 10 commercially deployed LLMs were prompted across four academic domains, generating 69,557 citation instances verified against three scholarly databases (namely, CrossRef, OpenAlex, and Semantic Scholar). Our results show that the observed hallucination rates span a fivefold range (between 11.4% and 56.8%) and are strongly shaped by model, domain, and prompt framing. Our results also show that no model spontaneously generates citations when unprompted, which seems to establish hallucination as prompt-induced rather than intrinsic. We identify two practical filters: 1) multi-model consensus (with more than 3 LLMs citing the same work yields 95.6% accuracy, a 5.8-fold improvem
arXiv:2603.03299v1 Announce Type: new Abstract: Large language models (LLMs) have been noted to fabricate scholarly citations, yet the scope of this behavior across providers, domains, and prompting conditions remains poorly quantified. We present one of the largest citation hallucination audits to date, in which 10 commercially deployed LLMs were prompted across four academic domains, generating 69,557 citation instances verified against three scholarly databases (namely, CrossRef, OpenAlex, and Semantic Scholar). Our results show that the observed hallucination rates span a fivefold range (between 11.4% and 56.8%) and are strongly shaped by model, domain, and prompt framing. Our results also show that no model spontaneously generates citations when unprompted, which seems to establish hallucination as prompt-induced rather than intrinsic. We identify two practical filters: 1) multi-model consensus (with more than 3 LLMs citing the same work yields 95.6% accuracy, a 5.8-fold improvement), and 2) within-prompt repetition (with more than 2 replications yields 88.9% accuracy). In addition, we present findings on generational model tracking, which reveal that improvements are not guaranteed when deploying newer LLMs, and on capacity scaling, which appears to reduce hallucination within model families. Finally, a lightweight classifier trained solely on bibliographic string features is developed to classify hallucinated citations from verified citations, achieving AUC 0.876 in cross-validation and 0.834 in LOMO generalization (without querying any external database). This classifier offers a pre-screening tool deployable at inference time.
Executive Summary
This study presents a comprehensive audit of citation hallucination across ten commercially deployed large language models (LLMs) across four academic domains, generating over 69,000 citation instances verified against CrossRef, OpenAlex, and Semantic Scholar. The findings reveal significant variability in hallucination rates—ranging from 11.4% to 56.8%—influenced by model type, domain specificity, and prompting conditions. Crucially, the research demonstrates that citation fabrication is prompt-induced rather than intrinsic, as no model spontaneously generates citations unprompted. The authors identify two effective detection filters: multi-model consensus (>3 LLMs citing the same work yields 95.6% accuracy) and within-prompt repetition (>2 replications yields 88.9% accuracy). Additionally, a lightweight classifier trained on bibliographic string features achieves strong classification performance (AUC 0.876 in cross-validation), offering a scalable pre-screening tool. These results have critical implications for academic integrity, content verification, and AI-assisted writing workflows. The study advances empirical understanding of AI citation behavior and provides actionable mitigation strategies.
Key Points
- ▸ Hallucination rates vary significantly across models, domains, and prompts (11.4% to 56.8%).
- ▸ Citation fabrication is prompt-dependent, not inherent to the model itself.
- ▸ Two practical detection filters—multi-model consensus and within-prompt repetition—are validated with high accuracy.
- ▸ A lightweight classifier using bibliographic string features achieves strong detection performance without external database queries.
Merits
Strength
The study’s scale and methodology—auditing ten LLMs across multiple domains and databases—provide robust, generalizable insights into citation hallucination.
Strength
The identification of actionable detection mechanisms (consensus and repetition filters) offers practical tools for academic institutions and content platforms to mitigate AI citation fraud.
Strength
The classifier’s performance (AUC 0.876) demonstrates viability as a pre-screening mechanism, enhancing efficiency in verification workflows.
Demerits
Limitation
While filters are effective, they are reactive—mitigating hallucination rather than preventing it at the source, raising questions about long-term systemic solutions.
Expert Commentary
This audit represents a watershed moment in the intersection of AI and academic scholarship. The scale of the study—examining ten LLMs across diverse domains with quantified hallucination rates—provides unprecedented empirical clarity. Importantly, the finding that hallucination is prompt-induced rather than intrinsic reframes the discourse: rather than viewing AI as inherently deceptive, we must recognize that human prompting engineering is the catalyst. The detection filters identified—multi-model consensus and repetition—are not merely technical fixes; they represent paradigm shifts in verification protocol. The lightweight classifier’s success without external database access is particularly noteworthy: it enables real-time, scalable detection without compromising efficiency. Moreover, the nuanced observation that newer models do not guarantee improved accuracy underscores a critical misconception in the ML lifecycle: performance gains are not linear or guaranteed. This work elevates the conversation from reactive patchwork solutions to proactive, principled governance of AI in scholarly writing. It should inform curriculum development, journal policies, and AI deployment frameworks across universities and academic publishers alike.
Recommendations
- ✓ Universities and academic publishers should integrate automated detection mechanisms (consensus filters and repetition detection) into writing platforms as standard verification protocols.
- ✓ Developers of LLMs should incorporate citation integrity checks as part of model evaluation benchmarks, particularly for academic-facing applications.
- ✓ Research institutions should fund longitudinal studies to monitor evolving hallucination behaviors as models evolve, ensuring detection systems remain adaptive.