TimeWarp: Evaluating Web Agents by Revisiting the Past
arXiv:2603.04949v1 Announce Type: new Abstract: The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout. TimeWarp consists of three web environments, each with six UI versions spanning different eras of the internet, paired with a set of complex, realistic tasks requiring different forms of web navigation. Our experiments reveal web agents' vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories. To address this, we propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions. By training agents on teacher rollouts using our BC-variant, we achieve substantial performance gains: $20.4\%\rightarrow37.7\%$ for Qwen-3 4B and $0\%\rightarrow27.0\%$ for Llama-3.1 8B model
arXiv:2603.04949v1 Announce Type: new Abstract: The improvement of web agents on current benchmarks raises the question: Do today's agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout. TimeWarp consists of three web environments, each with six UI versions spanning different eras of the internet, paired with a set of complex, realistic tasks requiring different forms of web navigation. Our experiments reveal web agents' vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories. To address this, we propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions. By training agents on teacher rollouts using our BC-variant, we achieve substantial performance gains: $20.4\%\rightarrow37.7\%$ for Qwen-3 4B and $0\%\rightarrow27.0\%$ for Llama-3.1 8B models. We hope our work helps researchers study generalization across web designs and unlock a new paradigm for collecting plans rather than trajectories, thereby improving the robustness of web agents.
Executive Summary
This article introduces TimeWarp, a benchmark that evaluates web agents' adaptability to the evolving web. By simulating different eras of the internet through containerized environments, TimeWarp assesses web agents' performance on complex tasks across various web designs. The authors propose TimeTraj, an algorithm that uses plan distillation to improve web agents' generalization across web designs. Experimental results show significant performance gains with TimeTraj, highlighting its potential to enhance web agents' robustness. The study underscores the importance of examining web agents' adaptability to changing web environments and suggests a paradigm shift from trajectory-based to plan-based approaches.
Key Points
- ▸ TimeWarp benchmark simulates evolving web environments through containerized UI, design, and layout variations.
- ▸ Web agents' vulnerability to changes in web design is revealed through experiments.
- ▸ TimeTraj algorithm uses plan distillation to improve generalization across web designs.
Merits
Strength in Evaluating Web Agents
TimeWarp provides a comprehensive evaluation framework for assessing web agents' adaptability to changing web environments.
Innovative Approach to Plan Distillation
TimeTraj's plan distillation method offers a promising solution for enhancing web agents' robustness and generalization across web designs.
Demerits
Limited Generalizability
The study's findings might not be directly applicable to all web agents and tasks, as the performance gains achieved with TimeTraj may be specific to the tested models and tasks.
Assumed Web Design Variations
The study relies on assumed variations in web design, which might not accurately reflect real-world changes in web environments.
Expert Commentary
The study presents a timely and relevant contribution to the field of web agents and their evaluation. By introducing TimeWarp and TimeTraj, the authors shed light on the critical issue of web agents' adaptability to changing web environments. The findings suggest that a paradigm shift from trajectory-based to plan-based approaches may be necessary to develop more robust web agents. The study's implications for web development, accessibility, and policy-making are significant, and further research is warranted to explore the potential applications of TimeWarp and TimeTraj in real-world settings.
Recommendations
- ✓ Future research should investigate the applicability of TimeWarp and TimeTraj to a broader range of web agents and tasks.
- ✓ Developers and stakeholders should consider incorporating TimeWarp or similar evaluation frameworks into their development pipelines to ensure the adaptability of web agents to changing web environments.