Skip to main content
Academic

Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?

arXiv:2602.15867v1 Announce Type: cross Abstract: In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game's dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ''extended thinkin

B
Berry Gerrits
· · 1 min read · 5 views

arXiv:2602.15867v1 Announce Type: cross Abstract: In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game's dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ''extended thinking''. Qualitative analysis of the models' reasoning processes reveals fundamental limitations: repeated unsuccessful actions suggesting an inability to reflect on one's own thinking, inconsistent persistence of strategies, and failure to learn from previous attempts despite access to conversation history. These findings suggest substantial limitations in current LLMs' metacognitive abilities and problem-solving capabilities within the domain of text-based games, raising questions about the nature and extent of their reasoning capabilities.

Executive Summary

The article evaluates the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) by assessing their performance in the 1977 text-based adventure game Zork. The study tests leading proprietary models—ChatGPT, Claude, and Gemini—under varying instruction conditions, measuring game progress through achieved scores. Results indicate that all models achieve less than 10% completion, with the best-performing model, Claude Opus 4.5, reaching only approximately 75 out of 350 possible points. Detailed instructions and extended thinking do not improve performance. Qualitative analysis reveals fundamental limitations in LLMs' metacognitive abilities, including repeated unsuccessful actions, inconsistent strategy persistence, and failure to learn from previous attempts. These findings raise questions about the nature and extent of LLMs' reasoning capabilities within text-based problem-solving domains.

Key Points

  • LLMs tested include ChatGPT, Claude, and Gemini.
  • All models achieve less than 10% completion in Zork.
  • Detailed instructions and extended thinking do not improve performance.
  • Qualitative analysis reveals fundamental limitations in metacognitive abilities.
  • Findings question the reasoning capabilities of LLMs in text-based problem-solving.

Merits

Controlled Environment

The use of Zork as a controlled environment for assessing LLM performance is a strength, providing a standardized metric for evaluating problem-solving and reasoning capabilities.

Comprehensive Testing

The study tests multiple leading LLMs under varying conditions, offering a broad perspective on their capabilities and limitations.

Demerits

Limited Scope

The study focuses solely on text-based adventure games, which may not fully represent the broader range of problem-solving scenarios LLMs might encounter.

Quantitative Metrics

Relying primarily on game scores as a metric may not capture the nuanced aspects of reasoning and problem-solving capabilities.

Expert Commentary

The study provides a valuable contribution to the understanding of LLMs' problem-solving and reasoning capabilities. The use of Zork as a controlled environment is a novel approach that offers insights into how LLMs interpret natural language descriptions and generate action sequences. However, the limited scope of the study and the reliance on quantitative metrics may not fully capture the complexity of LLMs' reasoning abilities. The findings highlight significant limitations in LLMs' metacognitive abilities, which are crucial for effective problem-solving. This raises important questions about the current state of LLMs and the need for further development to enhance their capabilities. The study also underscores the importance of ethical considerations in the deployment of LLMs, particularly in applications requiring complex reasoning and decision-making.

Recommendations

  • Further research should explore a broader range of problem-solving scenarios to provide a more comprehensive assessment of LLMs' capabilities.
  • Developers should prioritize enhancing LLMs' metacognitive abilities through improved training methodologies and algorithms.

Sources