Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks
arXiv:2602.19008v1 Announce Type: new Abstract: Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating envelope. We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction. We analyze trajectories from the Toolathlon benchmark: 22 frontier models each attempt 108 real-world tool-use tasks across 3 independent runs, yielding 515 model$\times$task units where the same model succeeds on some runs and fails on others due to LLM sampling stochasticity alone. Within these units, successful runs adhere significantly more closely to the
arXiv:2602.19008v1 Announce Type: new Abstract: Why do language agents fail on tasks they are capable of solving? We argue that many such failures are reliability failures caused by stochastic drift from a task's latent solution structure, not capability failures. Every well-defined tool-use task imposes a canonical solution path (i.e., a convergent set of tool invocations shared across successful runs) and agent success depends critically on whether a trajectory stays within this path's operating envelope. We establish this causally using a natural experiment that holds model capability and task difficulty fixed by construction. We analyze trajectories from the Toolathlon benchmark: 22 frontier models each attempt 108 real-world tool-use tasks across 3 independent runs, yielding 515 model$\times$task units where the same model succeeds on some runs and fails on others due to LLM sampling stochasticity alone. Within these units, successful runs adhere significantly more closely to the canonical solution path than failed runs ($+$0.060 Jaccard, $p<0.0001$, $n=488$ units, 95% CI [+0.043, +0.077]). This result survives six robustness checks including cross-model-family leave-one-out validation. Critically, the causal mechanism is gradual and self-reinforcing: the adherence gap is statistically indistinguishable from zero through the first 50% of the trajectory, ruling out early-branching selection bias, and each off-canonical tool call raises the probability that the next call is also off-canonical by 22.7 percentage points ($\hat{\beta}=+0.227$, $p<0.0001$), more than doubling the baseline rate. These findings imply that agent reliability cannot be improved by capability scaling alone, but offer a highly actionable intervention: a simple monitor that restarts the bottom tercile of runs based on mid-trajectory canonical adherence lifts success rates by $+$8.8 percentage points among intervened runs.
Executive Summary
This study investigates the reliability failures of language agents in long-horizon tasks, arguing that these failures are often due to stochastic drift from a task's latent solution structure, rather than capability failures. The researchers conduct a natural experiment using the Toolathlon benchmark, analyzing trajectories from 22 frontier models across 108 real-world tool-use tasks. The results show that successful runs adhere more closely to the canonical solution path than failed runs, and that the adherence gap is self-reinforcing. This suggests that agent reliability cannot be improved by capability scaling alone, but rather requires a reliability-focused intervention. The study highlights the potential of monitoring and restarting runs based on mid-trajectory canonical adherence to improve success rates.
Key Points
- ▸ Language agents fail on tasks they are capable of solving due to reliability failures caused by stochastic drift.
- ▸ Successful runs adhere more closely to the canonical solution path than failed runs.
- ▸ The adherence gap is self-reinforcing, with each off-canonical tool call increasing the probability of the next call being off-canonical.
Merits
Strength in Causal Mechanism Identification
The study establishes a causal link between canonical path deviation and agent failure, using a rigorous natural experiment design and robustness checks.
Methodological Innovation
The study's use of the Toolathlon benchmark and mid-trajectory canonical adherence monitoring represents a novel approach to evaluating agent reliability.
Demerits
Limitation in Generalizability
The study's findings may not generalize to all types of tasks or agent architectures, and further research is needed to explore these limitations.
Dependence on Benchmark Data
The study's results are heavily dependent on the quality and representativeness of the Toolathlon benchmark data.
Expert Commentary
This study represents a significant contribution to the field of artificial intelligence safety and robustness, highlighting the importance of reliability-focused interventions in improving agent performance. The use of a natural experiment design and robustness checks adds to the study's credibility and generalizability. However, the study's dependence on benchmark data and potential limitations in generalizability must be taken into account when interpreting the results. The study's findings have practical implications for the development of more reliable and robust artificial intelligence systems, and policy implications for the safe and responsible deployment of these systems.
Recommendations
- ✓ Future research should explore the generalizability of the study's findings to different types of tasks and agent architectures.
- ✓ The development of more robust and reliable artificial intelligence systems should prioritize reliability-focused interventions, such as monitoring and restarting runs based on mid-trajectory canonical adherence.