Academic

Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

arXiv:2602.12544v1 Announce Type: new Abstract: We present a scalable pipeline for automatically generating high-quality training data for web agents. In particular, a major challenge in identifying high-quality training instances is trajectory evaluation - quantifying how much progress was made towards task completion. We introduce a novel constraint-based evaluation framework that provides fine-grained assessment of progress towards task completion. This enables us to leverage partially successful trajectories, which significantly expands the amount of usable training data. We evaluate our method on a new benchmark we propose called BookingArena, which consists of complex booking tasks across 20 popular websites, and demonstrate that our distilled student model outperforms open-source approaches and matches or exceeds commercial systems, while being a significantly smaller model. Our work addresses the challenge of efficiently creating diverse, realistic web interaction datasets and

Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Creighton Glasscock, Honglak Lee · March 7, 2026 · 1 min read · 9 views

#cs.AI

Executive Summary

The article presents an innovative pipeline for generating high-quality training data for web agents, addressing the critical challenge of trajectory evaluation in task completion. The authors introduce a constraint-based evaluation framework that enables fine-grained assessment of progress, allowing the use of partially successful trajectories and significantly expanding the usable training data. The proposed method is evaluated on a new benchmark, BookingArena, which consists of complex booking tasks across 20 popular websites. The distilled student model demonstrates superior performance compared to open-source approaches and matches or exceeds commercial systems, despite being a smaller model. This work contributes to the development of diverse, realistic web interaction datasets and provides a systematic evaluation methodology for complex structured web tasks.

Key Points

▸ Introduction of a scalable pipeline for automatic generation of high-quality training data for web agents.
▸ Novel constraint-based evaluation framework for fine-grained assessment of task completion progress.
▸ Expansion of usable training data by leveraging partially successful trajectories.
▸ Evaluation on a new benchmark, BookingArena, consisting of complex booking tasks across 20 popular websites.
▸ Distilled student model outperforms open-source approaches and matches or exceeds commercial systems.

Merits

Innovative Evaluation Framework

The constraint-based evaluation framework provides a novel approach to assessing task completion progress, enabling the use of partially successful trajectories and significantly expanding the training data.

Scalability

The pipeline is designed to be scalable, allowing for the generation of high-quality training data efficiently and effectively.

Performance

The distilled student model demonstrates superior performance compared to existing open-source approaches and matches or exceeds commercial systems, despite being a smaller model.

Demerits

Benchmark Limitation

The new benchmark, BookingArena, is limited to complex booking tasks across 20 popular websites, which may not fully represent the diversity of web interaction tasks.

Generalizability

The generalizability of the proposed method to other types of web tasks and domains remains to be thoroughly evaluated.

Data Quality

The quality of the automatically generated training data may vary, and the effectiveness of the constraint-based evaluation framework in ensuring high-quality data needs further validation.

Expert Commentary

The article presents a significant advancement in the field of web agent training by addressing the critical challenge of trajectory evaluation. The introduction of a constraint-based evaluation framework enables fine-grained assessment of task completion progress, which is a novel and valuable contribution. The scalability of the pipeline and the performance of the distilled student model are particularly noteworthy, as they demonstrate the potential for significant improvements in web agent training. However, the generalizability of the method to other types of web tasks and domains remains to be thoroughly evaluated. Additionally, the quality of the automatically generated training data and the effectiveness of the evaluation framework in ensuring high-quality data need further validation. Overall, this work provides a systematic and scalable approach to generating high-quality training data and evaluating complex structured web tasks, which has important implications for both practical applications and policy decisions.

Recommendations

✓ Further validation of the constraint-based evaluation framework to ensure the quality of automatically generated training data.
✓ Thorough evaluation of the generalizability of the proposed method to other types of web tasks and domains.

Sources

arXiv - cs.AI

Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

AI Commentary

Executive Summary

Key Points

Merits

Innovative Evaluation Framework

Scalability

Performance

Demerits

Benchmark Limitation

Generalizability

Data Quality

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs