Academic

WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

arXiv:2602.17990v1 Announce Type: new Abstract: LLM-based systems increasingly generate structured workflows for complex tasks. In practice, automatic evaluation of these workflows is difficult, because metric scores are often not calibrated, and score changes do not directly communicate the severity of workflow degradation. We introduce WorkflowPerturb, a controlled benchmark for studying workflow evaluation metrics. It works by applying realistic, controlled perturbations to golden workflows. WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, and Description Changes), each applied at severity levels of 10%, 30%, and 50%. We benchmark multiple metric families and analyze their sensitivity and calibration using expected score trajectories and residuals. Our results characterize systematic differences across metric families and support severity-aware interpretation of workflow evaluation scores

Madhav Kanda, Pedro Las-Casas, Alok Gautam Kumbhare, Rodrigo Fonseca, Sharad Agarwal · March 7, 2026 · 1 min read · 20 views

#cs.AI

Executive Summary

The article introduces WorkflowPerturb, a benchmark for evaluating multi-agent workflow metrics. It applies controlled perturbations to golden workflows, analyzing metric sensitivity and calibration. The study characterizes differences across metric families, supporting severity-aware interpretation of workflow evaluation scores. WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants, providing a comprehensive dataset for workflow evaluation research.

Key Points

▸ Introduction of WorkflowPerturb, a controlled benchmark for workflow evaluation metrics
▸ Application of realistic perturbations to golden workflows to analyze metric sensitivity and calibration
▸ Characterization of systematic differences across metric families

Merits

Comprehensive Dataset

The dataset contains a large number of golden workflows and perturbed variants, making it a valuable resource for workflow evaluation research.

Demerits

Limited Generalizability

The study focuses on specific perturbation types and severity levels, which may not be representative of all possible workflow degradation scenarios.

Expert Commentary

The introduction of WorkflowPerturb marks a significant step forward in the development of calibrated stress tests for workflow evaluation metrics. By providing a comprehensive dataset and characterizing differences across metric families, the study enables researchers to develop more effective and reliable workflow evaluation methods. However, further research is needed to address the limited generalizability of the study's findings and to explore the applicability of WorkflowPerturb to diverse workflow domains.

Recommendations

✓ Future studies should investigate the applicability of WorkflowPerturb to various workflow domains and tasks
✓ Researchers should explore the development of more sophisticated perturbation methods to simulate real-world workflow degradation scenarios.

Sources

arXiv - cs.AI

WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Dataset

Demerits

Limited Generalizability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs