Academic

Traversal-as-Policy: Log-Distilled Gated Behavior Trees as Externalized, Verifiable Policies for Safe, Robust, and Efficient Agents

arXiv:2603.05517v1 Announce Type: cross Abstract: Autonomous LLM agents fail because long-horizon policy remains implicit in model weights and transcripts, while safety is retrofitted post hoc. We propose Traversal-as-Policy: distill sandboxed OpenHands execution logs into a single executable Gated Behavior Tree (GBT) and treat tree traversal -- rather than unconstrained generation -- as the control policy whenever a task is in coverage. Each node encodes a state-conditioned action macro mined and merge-checked from successful trajectories; macros implicated by unsafe traces attach deterministic pre-execution gates over structured tool context and bounded history, updated under experience-grounded monotonicity so previously rejected unsafe contexts cannot be re-admitted. At runtime, a lightweight traverser matches the base model's intent to child macros, executes one macro at a time under global and node-local gating, and when stalled performs risk-aware shortest-path recovery to a fe

arXiv:2603.05517v1 Announce Type: cross Abstract: Autonomous LLM agents fail because long-horizon policy remains implicit in model weights and transcripts, while safety is retrofitted post hoc. We propose Traversal-as-Policy: distill sandboxed OpenHands execution logs into a single executable Gated Behavior Tree (GBT) and treat tree traversal -- rather than unconstrained generation -- as the control policy whenever a task is in coverage. Each node encodes a state-conditioned action macro mined and merge-checked from successful trajectories; macros implicated by unsafe traces attach deterministic pre-execution gates over structured tool context and bounded history, updated under experience-grounded monotonicity so previously rejected unsafe contexts cannot be re-admitted. At runtime, a lightweight traverser matches the base model's intent to child macros, executes one macro at a time under global and node-local gating, and when stalled performs risk-aware shortest-path recovery to a feasible success leaf; the visited path forms a compact spine memory that replaces transcript replay. Evaluated in a unified OpenHands sandbox on 15+ software, web, reasoning, and safety/security benchmarks, GBT improves success while driving violations toward zero and reducing cost. On SWE-bench Verified (Protocol A, 500 issues), GBT-SE raises success from 34.6% to 73.6%, reduces violations from 2.8% to 0.2%, and cuts token/character usage from 208k/820k to 126k/490k; with the same distilled tree, 8B executors more than double success on SWE-bench Verified (14.0%58.8%) and WebArena (9.1%37.3%).

Executive Summary

The article 'Traversal-as-Policy: Log-Distilled Gated Behavior Trees as Externalized, Verifiable Policies for Safe, Robust, and Efficient Agents' proposes a novel approach to policy-based control for autonomous Large Language Model (LLM) agents. The authors develop a method, Traversal-as-Policy, which distills logs from sandboxed executions into a single executable Gated Behavior Tree (GBT). This approach treats tree traversal, rather than unconstrained generation, as the control policy. The GBT is designed to improve safety, robustness, and efficiency by incorporating state-conditioned action macros, deterministic pre-execution gates, and experience-grounded monotonicity. The authors demonstrate the effectiveness of GBT in a unified OpenHands sandbox across various benchmarks, achieving significant improvements in success rates and reductions in violations and cost.

Key Points

  • Traversal-as-Policy is a novel approach to policy-based control for autonomous LLM agents.
  • GBT distills logs from sandboxed executions into a single executable tree.
  • The GBT incorporates state-conditioned action macros, deterministic pre-execution gates, and experience-grounded monotonicity.
  • The authors demonstrate significant improvements in success rates and reductions in violations and cost.

Merits

Improved Safety and Robustness

The GBT's incorporation of deterministic pre-execution gates and experience-grounded monotonicity improves safety and robustness by preventing previously rejected unsafe contexts from being re-admitted.

Efficient Policy Control

The Traversal-as-Policy approach enables efficient policy control by treating tree traversal as the control policy, rather than unconstrained generation.

Scalability and Flexibility

The GBT can be distilled from various logs and executed across different environments, making it a scalable and flexible solution.

Demerits

Complexity and Overhead

The GBT's design and implementation may introduce additional complexity and overhead, which could impact performance and scalability.

Dependence on Log Quality

The effectiveness of the GBT depends on the quality and quantity of the logs used for distillation, which may be challenging to obtain in practice.

Limited Generalizability

The GBT's performance and effectiveness may be limited to the specific benchmarking environments and tasks used in the study.

Expert Commentary

The article presents a novel and intriguing approach to policy-based control for autonomous LLM agents. The GBT's design and implementation demonstrate a clear understanding of the challenges and limitations of current policy-based control methods. While the study is impressive in its scope and breadth, several questions and concerns remain, including the complexity and overhead of the GBT, its dependence on log quality, and limited generalizability. Nevertheless, the GBT's potential for improving safety, robustness, and efficiency in LLM agents makes it a promising area of research. Future studies should aim to further investigate the GBT's limitations and explore its scalability and applicability to diverse domains and tasks.

Recommendations

  • Future studies should aim to further investigate the GBT's limitations, including its complexity and overhead, dependence on log quality, and limited generalizability.
  • The GBT's design and implementation should be explored in more diverse domains and tasks to assess its scalability and applicability.

Sources