Skip to main content
Academic

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

arXiv:2602.14080v1 Announce Type: new Abstract: Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to a

arXiv:2602.14080v1 Announce Type: new Abstract: Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

Executive Summary

The article 'Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality' introduces a novel framework to evaluate the factual accuracy of large language models (LLMs). It distinguishes between missing knowledge (empty shelves) and limited access to encoded facts (lost keys). The study presents WikiProfile, a benchmark constructed via an automated pipeline, to profile factual knowledge in LLMs. The findings indicate that frontier models like GPT-5 and Gemini-3 have near-saturated encoding of facts but struggle with recall, particularly for long-tail facts and reverse questions. The study suggests that future improvements in LLMs may rely more on enhancing recall mechanisms rather than scaling up models.

Key Points

  • Introduction of a behavioral framework to profile factual knowledge in LLMs.
  • Distinction between missing knowledge (empty shelves) and limited access to encoded facts (lost keys).
  • Presentation of WikiProfile, a new benchmark for evaluating factual knowledge.
  • Findings that frontier models have high encoding but struggle with recall.
  • Suggestion that future improvements may focus on recall mechanisms rather than scaling.

Merits

Innovative Framework

The article introduces a novel framework that provides a more nuanced understanding of factual errors in LLMs by distinguishing between missing knowledge and limited access to encoded facts.

Comprehensive Benchmark

WikiProfile, the new benchmark introduced, is constructed via an automated pipeline and provides a robust method for evaluating factual knowledge in LLMs.

Empirical Evidence

The study presents empirical evidence from 4 million responses across 13 LLMs, providing strong support for the conclusions drawn.

Demerits

Limited Scope

The study focuses primarily on factual recall and may not fully capture other aspects of LLM performance, such as reasoning or contextual understanding.

Benchmark Construction

The automated pipeline for constructing WikiProfile relies on web search and prompted LLMs, which may introduce biases or inaccuracies.

Generalizability

The findings are based on specific models and may not be generalizable to all types of LLMs or applications.

Expert Commentary

The article presents a rigorous and well-reasoned analysis of factual errors in LLMs, introducing a novel framework that distinguishes between missing knowledge and limited access to encoded facts. The study's findings are supported by a comprehensive benchmark, WikiProfile, and empirical evidence from 4 million responses across 13 LLMs. The distinction between empty shelves and lost keys provides a nuanced understanding of factual errors, which is crucial for improving the reliability and trustworthiness of AI systems. The study's suggestion that future improvements may rely more on enhancing recall mechanisms rather than scaling up models is particularly insightful and challenges the current focus on model scaling. However, the study's limitations, such as the reliance on an automated pipeline for benchmark construction and the potential biases introduced, should be acknowledged. Overall, the article makes a significant contribution to the field and provides valuable insights for both practitioners and policymakers.

Recommendations

  • Further research should explore the generalizability of the findings to different types of LLMs and applications.
  • Developers should consider integrating the proposed framework and benchmark into their evaluation and improvement processes.
  • Policymakers should consider the implications of recall failures in AI systems and promote the use of robust evaluation methods.

Sources