MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
arXiv:2602.22638v1 Announce Type: new Abstract: Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, comp
arXiv:2602.22638v1 Announce Type: new Abstract: Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .
Executive Summary
This study introduces MobilityBench, a scalable benchmark for evaluating route-planning agents in real-world mobility scenarios. The benchmark utilizes large-scale, anonymized user queries collected from Amap and a deterministic API-replay sandbox to enable reproducible evaluation. The authors propose a multi-dimensional evaluation protocol and evaluate multiple LLM-based route-planning agents, highlighting the need for improvement in personalized mobility applications. The study contributes to the development of effective route-planning systems and underscores the importance of systematic evaluation in real-world mobility settings.
Key Points
- ▸ MobilityBench is a scalable benchmark for evaluating route-planning agents in real-world mobility scenarios
- ▸ The benchmark utilizes large-scale, anonymized user queries collected from Amap
- ▸ A deterministic API-replay sandbox is designed to enable reproducible evaluation
Merits
Strength in Methodology
The study employs a robust methodology for evaluating route-planning agents in real-world mobility scenarios, enhancing the reproducibility and reliability of the results
Comprehensive Evaluation Protocol
The authors propose a multi-dimensional evaluation protocol that assesses various aspects of route-planning agents, providing a more comprehensive understanding of their performance
Demerits
Limited Scope
The study focuses on LLM-based route-planning agents and may not be applicable to other types of route-planning systems
Data Dependence
The results may be dependent on the quality and diversity of the data used to construct the MobilityBench benchmark
Expert Commentary
The study introduces MobilityBench, a significant contribution to the field of route-planning research. The benchmark's scalability and reproducibility enable a more comprehensive evaluation of route-planning agents, which is essential for developing effective systems. The study's findings highlight the limitations of current LLM-based route-planning agents and underscore the need for further research in this area. The implications of this study are far-reaching, with the potential to inform policy decisions and improve transportation systems worldwide.
Recommendations
- ✓ Future research should focus on developing more effective LLM-based route-planning agents that cater to diverse user needs and preferences
- ✓ Transportation agencies and urban planners should consider incorporating systematic evaluation protocols into their decision-making processes to ensure the development of effective transportation systems