Academic

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

arXiv:2602.14080v1 Announce Type: new Abstract: Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to a

Nitay Calderon, Eyal Ben-David, Zorik Gekhman, Eran Ofek, Gal Yona · February 18, 2026 · 1 min read · 9 views

#cs.CL #cs.AI

Executive Summary

The article 'Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality' introduces a novel framework to evaluate the factual accuracy of large language models (LLMs). It distinguishes between missing knowledge (empty shelves) and limited access to encoded facts (lost keys). The study presents WikiProfile, a benchmark constructed via an automated pipeline, to profile factual knowledge in LLMs. The findings indicate that frontier models like GPT-5 and Gemini-3 have near-saturated encoding of facts but struggle with recall, particularly for long-tail facts and reverse questions. The study suggests that future improvements in LLMs may rely more on enhancing recall mechanisms rather than scaling up models.

Key Points

▸ Introduction of a behavioral framework to profile factual knowledge in LLMs.
▸ Distinction between missing knowledge (empty shelves) and limited access to encoded facts (lost keys).
▸ Presentation of WikiProfile, a new benchmark for evaluating factual knowledge.
▸ Findings that frontier models have high encoding but struggle with recall.
▸ Suggestion that future improvements may focus on recall mechanisms rather than scaling.

Merits

Innovative Framework

The article introduces a novel framework that provides a more nuanced understanding of factual errors in LLMs by distinguishing between missing knowledge and limited access to encoded facts.

Comprehensive Benchmark

WikiProfile, the new benchmark introduced, is constructed via an automated pipeline and provides a robust method for evaluating factual knowledge in LLMs.

Empirical Evidence

The study presents empirical evidence from 4 million responses across 13 LLMs, providing strong support for the conclusions drawn.

Demerits

Limited Scope

The study focuses primarily on factual recall and may not fully capture other aspects of LLM performance, such as reasoning or contextual understanding.

Benchmark Construction

The automated pipeline for constructing WikiProfile relies on web search and prompted LLMs, which may introduce biases or inaccuracies.

Generalizability

The findings are based on specific models and may not be generalizable to all types of LLMs or applications.

Expert Commentary

The article presents a rigorous and well-reasoned analysis of factual errors in LLMs, introducing a novel framework that distinguishes between missing knowledge and limited access to encoded facts. The study's findings are supported by a comprehensive benchmark, WikiProfile, and empirical evidence from 4 million responses across 13 LLMs. The distinction between empty shelves and lost keys provides a nuanced understanding of factual errors, which is crucial for improving the reliability and trustworthiness of AI systems. The study's suggestion that future improvements may rely more on enhancing recall mechanisms rather than scaling up models is particularly insightful and challenges the current focus on model scaling. However, the study's limitations, such as the reliance on an automated pipeline for benchmark construction and the potential biases introduced, should be acknowledged. Overall, the article makes a significant contribution to the field and provides valuable insights for both practitioners and policymakers.

Recommendations

✓ Further research should explore the generalizability of the findings to different types of LLMs and applications.
✓ Developers should consider integrating the proposed framework and benchmark into their evaluation and improvement processes.
✓ Policymakers should consider the implications of recall failures in AI systems and promote the use of robust evaluation methods.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

AI Commentary

Executive Summary

Key Points

Merits

Innovative Framework

Comprehensive Benchmark

Empirical Evidence

Demerits

Limited Scope

Benchmark Construction

Generalizability

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.