Agentic LLM Planning via Step-Wise PDDL Simulation: An Empirical Characterisation
arXiv:2603.06064v1 Announce Type: new Abstract: Task planning, the problem of sequencing actions to reach a goal from an initial state, is a core capability requirement for autonomous robotic systems. Whether large language models (LLMs) can serve as viable planners alongside classical symbolic methods remains an open question. We present PyPDDLEngine, an open-source Planning Domain Definition Language (PDDL) simulation engine that exposes planning operations as LLM tool calls through a Model Context Protocol (MCP) interface. Rather than committing to a complete action sequence upfront, the LLM acts as an interactive search policy that selects one action at a time, observes each resulting state, and can reset and retry. We evaluate four approaches on 102 International Planning Competition (IPC) Blocksworld instances under a uniform 180-second budget: Fast Downward lama-first and seq-sat-lama-2011 as classical baselines, direct LLM planning (Claude Haiku 4.5), and agentic LLM planning
arXiv:2603.06064v1 Announce Type: new Abstract: Task planning, the problem of sequencing actions to reach a goal from an initial state, is a core capability requirement for autonomous robotic systems. Whether large language models (LLMs) can serve as viable planners alongside classical symbolic methods remains an open question. We present PyPDDLEngine, an open-source Planning Domain Definition Language (PDDL) simulation engine that exposes planning operations as LLM tool calls through a Model Context Protocol (MCP) interface. Rather than committing to a complete action sequence upfront, the LLM acts as an interactive search policy that selects one action at a time, observes each resulting state, and can reset and retry. We evaluate four approaches on 102 International Planning Competition (IPC) Blocksworld instances under a uniform 180-second budget: Fast Downward lama-first and seq-sat-lama-2011 as classical baselines, direct LLM planning (Claude Haiku 4.5), and agentic LLM planning via PyPDDLEngine. Fast Downward achieves 85.3% success. The direct and agentic LLM approaches achieve 63.7% and 66.7%, respectively, a consistent but modest three-percentage-point advantage for the agentic approach at $5.7\times$ higher token cost per solution. Across most co-solved difficulty blocks, both LLM approaches produce shorter plans than seq-sat-lama-2011 despite its iterative quality improvement, a result consistent with training-data recall rather than generalisable planning. These results suggest that agentic gains depend on the nature of environmental feedback. Coding agents benefit from externally grounded signals such as compiler errors and test failures, whereas PDDL step feedback is self-assessed, leaving the agent to evaluate its own progress without external verification.
Executive Summary
This study investigates the viability of large language models (LLMs) as autonomous planners in task-solving domains, specifically via PDDL-based simulation. Using PyPDDLEngine, the authors integrate LLM-driven planning as an interactive search policy that iteratively selects actions, observes state changes, and resets on failure. Evaluated against classical planners (Fast Downward) and direct LLM planning, agentic LLM planning via PyPDDLEngine achieves 66.7% success on Blocksworld instances—moderately outperforming direct LLM planning (63.7%) at a higher token cost. Notably, LLM approaches yield shorter plans than classical iterative planners on most co-solved tasks, suggesting reliance on training-data recall rather than generalizable planning reasoning. The findings underscore that agentic gains are context-dependent, particularly influenced by the availability and nature of external feedback signals.
Key Points
- ▸ Agentic LLM planning via PyPDDLEngine demonstrates competitive performance against classical planners but with higher token cost.
- ▸ LLM approaches produce shorter plans than classical iterative planners on most co-solved tasks, indicating potential reliance on training-data recall.
- ▸ Agentic planning gains are contingent on environmental feedback; PDDL step feedback lacks external verification, limiting generalizability.
Merits
Empirical Rigor
The study conducts a comprehensive evaluation across 102 IPC instances with uniform time constraints, providing quantifiable comparative metrics across four approaches.
Novelty of Integration
The use of a Model Context Protocol (MCP) to expose PDDL operations as LLM tool calls represents a novel architectural bridge between symbolic planning and LLMs.
Demerits
Cost Disadvantage
Agentic LLM planning incurs a 5.7x higher token cost per solution compared to direct LLM planning, raising scalability concerns for real-world deployment.
Generalizability Concern
Results suggest LLM planning is data-recall-dependent rather than inductive; this limits applicability to novel or unseen domains.
Expert Commentary
The paper makes a valuable contribution to the evolving landscape of AI planning by empirically characterizing the limitations and affordances of agentic LLM-based planning. The decision to frame LLM planning as an interactive search policy—rather than a deterministic sequence generator—is conceptually sound and aligns with the nature of LLMs as probabilistic, iterative responders. However, the observed reliance on training-data recall raises deeper questions: Is the agentic advantage a function of memorized solution templates or emergent reasoning? The absence of a control for domain novelty further limits interpretability. Moreover, the token cost disparity introduces a practical barrier to adoption in resource-constrained robotic platforms. While the MCP interface is technically elegant, its long-term viability depends on whether LLM capabilities evolve to support persistent state awareness beyond iterative reset cycles. Finally, the contrast between self-assessed PDDL feedback and externally validated compiler signals illuminates a fundamental design tension: autonomy versus accountability. This work does not resolve these tensions but illuminates them with empirical clarity, setting a benchmark for future hybrid planning research.
Recommendations
- ✓ Future work should incorporate domain-novelty metrics to isolate the impact of training data recall from generalizable reasoning.
- ✓ Develop lightweight external verification interfaces (e.g., lightweight state-diff APIs) to reduce token cost while enabling agentic accountability.
- ✓ Explore ensemble architectures that combine agentic LLM selection with classical heuristic pruning to balance cost and success rate.