Academic

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Zhiheng Song, Jingshuai Zhang, Chuan Qin, Chao Wang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu, Hengshu Zhu · March 1, 2026 · 1 min read · 3 views

#cs.AI

arXiv:2602.22638v1 Announce Type: new Abstract: Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .

Executive Summary

This study introduces MobilityBench, a scalable benchmark for evaluating route-planning agents in real-world mobility scenarios. The benchmark utilizes large-scale, anonymized user queries collected from Amap and a deterministic API-replay sandbox to enable reproducible evaluation. The authors propose a multi-dimensional evaluation protocol and evaluate multiple LLM-based route-planning agents, highlighting the need for improvement in personalized mobility applications. The study contributes to the development of effective route-planning systems and underscores the importance of systematic evaluation in real-world mobility settings.

Key Points

▸ MobilityBench is a scalable benchmark for evaluating route-planning agents in real-world mobility scenarios
▸ The benchmark utilizes large-scale, anonymized user queries collected from Amap
▸ A deterministic API-replay sandbox is designed to enable reproducible evaluation

Merits

Strength in Methodology

The study employs a robust methodology for evaluating route-planning agents in real-world mobility scenarios, enhancing the reproducibility and reliability of the results

Comprehensive Evaluation Protocol

The authors propose a multi-dimensional evaluation protocol that assesses various aspects of route-planning agents, providing a more comprehensive understanding of their performance

Demerits

Limited Scope

The study focuses on LLM-based route-planning agents and may not be applicable to other types of route-planning systems

Data Dependence

The results may be dependent on the quality and diversity of the data used to construct the MobilityBench benchmark

Expert Commentary

The study introduces MobilityBench, a significant contribution to the field of route-planning research. The benchmark's scalability and reproducibility enable a more comprehensive evaluation of route-planning agents, which is essential for developing effective systems. The study's findings highlight the limitations of current LLM-based route-planning agents and underscore the need for further research in this area. The implications of this study are far-reaching, with the potential to inform policy decisions and improve transportation systems worldwide.

Recommendations

✓ Future research should focus on developing more effective LLM-based route-planning agents that cater to diverse user needs and preferences
✓ Transportation agencies and urban planners should consider incorporating systematic evaluation protocols into their decision-making processes to ensure the development of effective transportation systems

Sources

arXiv - cs.AI

Something extraordinary is coming.

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

AI Commentary

Executive Summary

Key Points

Merits

Strength in Methodology

Comprehensive Evaluation Protocol

Demerits

Limited Scope

Data Dependence

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.