Skip to main content
Academic

LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs

arXiv:2602.16902v1 Announce Type: new Abstract: We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilitie

arXiv:2602.16902v1 Announce Type: new Abstract: We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.

Executive Summary

This article presents LLM-WikiRace, a benchmark for evaluating the planning, reasoning, and world knowledge capabilities of large language models (LLMs). The benchmark requires models to navigate Wikipedia hyperlinks to reach a target page from a given source, testing their ability to plan and reason about real-world concepts. The authors evaluate several models, including Gemini-3, GPT-5, and Claude Opus 4.5, and find that while these models demonstrate superhuman performance on the easy level, their performance drops sharply on the hard difficulty level. The analysis highlights the importance of world knowledge and planning capabilities in LLMs, as well as the need for improved long-horizon reasoning and replanning abilities. The LLM-WikiRace benchmark provides a valuable tool for evaluating the limitations of current LLMs and identifying areas for future research.

Key Points

  • LLM-WikiRace is a benchmark for evaluating the planning, reasoning, and world knowledge capabilities of LLMs.
  • The benchmark requires models to navigate Wikipedia hyperlinks to reach a target page from a given source.
  • While LLMs demonstrate superhuman performance on the easy level, their performance drops sharply on the hard difficulty level.

Merits

Strength

The LLM-WikiRace benchmark provides a valuable tool for evaluating the limitations of current LLMs and identifying areas for future research. It highlights the importance of world knowledge and planning capabilities in LLMs, as well as the need for improved long-horizon reasoning and replanning abilities.

Demerits

Limitation

The benchmark may not be representative of real-world scenarios, as it relies on Wikipedia hyperlinks, which may not reflect the complexity and variability of real-world data.

Expert Commentary

The LLM-WikiRace benchmark presents a novel and valuable approach to evaluating the performance of LLMs. By highlighting the importance of world knowledge and planning capabilities in these models, it underscores the need for improved long-horizon reasoning and replanning abilities. While the benchmark may have limitations in terms of its representativeness of real-world scenarios, it provides a useful tool for developers and researchers to evaluate the performance of LLMs and identify areas for improvement. The findings of this study may inform policy decisions related to the development and deployment of LLMs, emphasizing the need for these models to possess strong planning and reasoning capabilities.

Recommendations

  • Developers and researchers should consider using the LLM-WikiRace benchmark to evaluate the performance of LLMs and identify areas for improvement.
  • The findings of this study should inform policy decisions related to the development and deployment of LLMs, emphasizing the need for these models to possess strong planning and reasoning capabilities.

Sources