Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation
arXiv:2602.12544v1 Announce Type: new Abstract: We present a scalable pipeline for automatically generating high-quality training data for web agents. In particular, a major challenge in identifying high-quality training instances is trajectory evaluation - quantifying how much progress was made towards task completion. We introduce a novel constraint-based evaluation framework that provides fine-grained assessment of progress towards task completion. This enables us to leverage partially successful trajectories, which significantly expands the amount of usable training data. We evaluate our method on a new benchmark we propose called BookingArena, which consists of complex booking tasks across 20 popular websites, and demonstrate that our distilled student model outperforms open-source approaches and matches or exceeds commercial systems, while being a significantly smaller model. Our work addresses the challenge of efficiently creating diverse, realistic web interaction datasets and
arXiv:2602.12544v1 Announce Type: new Abstract: We present a scalable pipeline for automatically generating high-quality training data for web agents. In particular, a major challenge in identifying high-quality training instances is trajectory evaluation - quantifying how much progress was made towards task completion. We introduce a novel constraint-based evaluation framework that provides fine-grained assessment of progress towards task completion. This enables us to leverage partially successful trajectories, which significantly expands the amount of usable training data. We evaluate our method on a new benchmark we propose called BookingArena, which consists of complex booking tasks across 20 popular websites, and demonstrate that our distilled student model outperforms open-source approaches and matches or exceeds commercial systems, while being a significantly smaller model. Our work addresses the challenge of efficiently creating diverse, realistic web interaction datasets and provides a systematic evaluation methodology for complex structured web tasks.
Executive Summary
The article presents an innovative pipeline for generating high-quality training data for web agents, addressing the critical challenge of trajectory evaluation in task completion. The authors introduce a constraint-based evaluation framework that enables fine-grained assessment of progress, allowing the use of partially successful trajectories and significantly expanding the usable training data. The proposed method is evaluated on a new benchmark, BookingArena, which consists of complex booking tasks across 20 popular websites. The distilled student model demonstrates superior performance compared to open-source approaches and matches or exceeds commercial systems, despite being a smaller model. This work contributes to the development of diverse, realistic web interaction datasets and provides a systematic evaluation methodology for complex structured web tasks.
Key Points
- ▸ Introduction of a scalable pipeline for automatic generation of high-quality training data for web agents.
- ▸ Novel constraint-based evaluation framework for fine-grained assessment of task completion progress.
- ▸ Expansion of usable training data by leveraging partially successful trajectories.
- ▸ Evaluation on a new benchmark, BookingArena, consisting of complex booking tasks across 20 popular websites.
- ▸ Distilled student model outperforms open-source approaches and matches or exceeds commercial systems.
Merits
Innovative Evaluation Framework
The constraint-based evaluation framework provides a novel approach to assessing task completion progress, enabling the use of partially successful trajectories and significantly expanding the training data.
Scalability
The pipeline is designed to be scalable, allowing for the generation of high-quality training data efficiently and effectively.
Performance
The distilled student model demonstrates superior performance compared to existing open-source approaches and matches or exceeds commercial systems, despite being a smaller model.
Demerits
Benchmark Limitation
The new benchmark, BookingArena, is limited to complex booking tasks across 20 popular websites, which may not fully represent the diversity of web interaction tasks.
Generalizability
The generalizability of the proposed method to other types of web tasks and domains remains to be thoroughly evaluated.
Data Quality
The quality of the automatically generated training data may vary, and the effectiveness of the constraint-based evaluation framework in ensuring high-quality data needs further validation.
Expert Commentary
The article presents a significant advancement in the field of web agent training by addressing the critical challenge of trajectory evaluation. The introduction of a constraint-based evaluation framework enables fine-grained assessment of task completion progress, which is a novel and valuable contribution. The scalability of the pipeline and the performance of the distilled student model are particularly noteworthy, as they demonstrate the potential for significant improvements in web agent training. However, the generalizability of the method to other types of web tasks and domains remains to be thoroughly evaluated. Additionally, the quality of the automatically generated training data and the effectiveness of the evaluation framework in ensuring high-quality data need further validation. Overall, this work provides a systematic and scalable approach to generating high-quality training data and evaluating complex structured web tasks, which has important implications for both practical applications and policy decisions.
Recommendations
- ✓ Further validation of the constraint-based evaluation framework to ensure the quality of automatically generated training data.
- ✓ Thorough evaluation of the generalizability of the proposed method to other types of web tasks and domains.