Academic

Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation

arXiv:2603.03116v1 Announce Type: new Abstract: Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents observe, communicate, and execute. PAE evaluates agents along complementary axes (Utility, Efficiency, Interaction Quality, Procedural Integrity) and applies multi-dimensional gating that categorically disqualifies corrupt outcomes. Evaluating state-of-the-art LLM agents on tau-bench yields findings at the axis, compliance, and benchmark levels. At the axis level, the dimensions capture non-redundant failure modes: utility masks reliability gaps, speed does not imply precision, and conciseness does not predict intent adherence. At the procedural compliance level, 27-78% of benchmark reported success

H
Hongliu Cao, Ilias Driouich, Eoin Thomas
· · 1 min read · 12 views

arXiv:2603.03116v1 Announce Type: new Abstract: Large Language Model (LLM)-based agents are increasingly adopted in high-stakes settings, but current benchmarks evaluate mainly whether a task was completed, not how. We introduce Procedure-Aware Evaluation (PAE), a framework that formalizes agent procedures as structured observations and exposes consistency relationships between what agents observe, communicate, and execute. PAE evaluates agents along complementary axes (Utility, Efficiency, Interaction Quality, Procedural Integrity) and applies multi-dimensional gating that categorically disqualifies corrupt outcomes. Evaluating state-of-the-art LLM agents on tau-bench yields findings at the axis, compliance, and benchmark levels. At the axis level, the dimensions capture non-redundant failure modes: utility masks reliability gaps, speed does not imply precision, and conciseness does not predict intent adherence. At the procedural compliance level, 27-78% of benchmark reported successes are corrupt successes concealing violations across interaction and integrity. Furthermore, gating substantially collapses Pass^4 rate and affects model rankings. The analysis of corrupt success cases reveals distinctive per-model failure signatures: GPT-5 spreads errors across policy, execution, and intent dimensions; Kimi-K2-Thinking concentrates 78% of violations in policy faithfulness and compliance; and Mistral-Large-3 is dominated by faithfulness failures. At the benchmark level, our analysis exposes structural flaws in the benchmark design, including task scope gaps, contradictory reward signals, and simulator artifacts that produce accidental successes.

Executive Summary

This article introduces Procedure-Aware Evaluation (PAE), a framework that assesses Large Language Model (LLM)-based agents beyond task completion. PAE evaluates agents on utility, efficiency, interaction quality, and procedural integrity, revealing corrupt successes and non-redundant failure modes. The framework exposes structural flaws in benchmark design and distinctive per-model failure signatures, emphasizing the need for more comprehensive evaluation methods.

Key Points

  • Introduction of Procedure-Aware Evaluation (PAE) framework
  • Evaluation of LLM agents on tau-bench reveals corrupt successes and failure modes
  • Exposure of structural flaws in benchmark design and per-model failure signatures

Merits

Comprehensive Evaluation

PAE provides a more nuanced understanding of LLM agent performance, moving beyond simple task completion metrics

Exposure of Corrupt Successes

The framework reveals corrupt successes, which can have significant implications for high-stakes settings

Demerits

Limited Scope

The study focuses on a specific benchmark (tau-bench) and may not be generalizable to other domains or tasks

Expert Commentary

The introduction of PAE marks a significant step forward in the evaluation of LLM agents, highlighting the importance of considering procedural integrity and corrupt successes. The framework's ability to expose distinctive per-model failure signatures and structural flaws in benchmark design has important implications for AI development and deployment. However, further research is needed to generalize the findings to other domains and tasks, and to develop more effective and comprehensive evaluation methods.

Recommendations

  • The development of more comprehensive evaluation frameworks that incorporate multiple dimensions and metrics
  • The prioritization of transparency and accountability in AI development and deployment, with a focus on procedural integrity and corrupt successes

Sources