AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
arXiv:2602.14257v1 Announce Type: new Abstract: While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, mul
arXiv:2602.14257v1 Announce Type: new Abstract: While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench-leaderboard.
Executive Summary
The article introduces AD-Bench, a novel benchmark designed to evaluate the performance of Large Language Model (LLM) agents in real-world advertising and marketing analytics. Unlike existing benchmarks that rely on idealized simulations, AD-Bench is built from real user requests and incorporates multi-round, multi-tool interactions. The benchmark categorizes tasks into three difficulty levels and provides verifiable reference answers and tool-call trajectories. Experiments with Gemini-3-Pro reveal significant performance gaps, particularly in complex scenarios, highlighting the need for further improvement in LLM agents for specialized domains.
Key Points
- ▸ AD-Bench is designed to evaluate LLM agents in real-world advertising and marketing analytics.
- ▸ The benchmark includes real user requests and categorizes tasks into three difficulty levels.
- ▸ Experiments show that state-of-the-art models like Gemini-3-Pro have substantial capability gaps in complex scenarios.
- ▸ AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents.
- ▸ The benchmark is available publicly, including a leaderboard and code.
Merits
Real-World Relevance
AD-Bench addresses a critical gap in evaluating LLM agents by focusing on real-world business requirements in advertising and marketing analytics.
Comprehensive Evaluation
The benchmark categorizes tasks into three difficulty levels, providing a nuanced assessment of agents' capabilities.
Transparency and Accessibility
The availability of the leaderboard and code ensures transparency and encourages further research and improvement.
Demerits
Limited Scope
The benchmark is specific to advertising and marketing analytics, which may limit its applicability to other domains.
Performance Gaps
The significant performance drops in complex scenarios indicate that current LLM agents are not yet fully capable in specialized domains.
Dependence on Expert Input
The benchmark relies on domain experts for verifiable reference answers, which could introduce bias or subjectivity.
Expert Commentary
The introduction of AD-Bench marks a significant step forward in the evaluation of LLM agents for real-world applications. The benchmark's focus on real-world business requirements and its comprehensive evaluation framework provide a robust tool for assessing the capabilities of AI agents in specialized domains. However, the performance gaps identified in complex scenarios underscore the need for continued research and development. The reliance on domain experts for reference answers, while ensuring accuracy, also introduces potential biases that need to be addressed. Overall, AD-Bench sets a new standard for evaluating AI agents in advertising and marketing analytics, and its public availability will undoubtedly foster further advancements in the field.
Recommendations
- ✓ Expand the benchmark to include a broader range of domains beyond advertising and marketing analytics.
- ✓ Develop methods to mitigate potential biases introduced by domain experts in the reference answers.
- ✓ Encourage further research to address the performance gaps identified in complex scenarios, particularly in multi-round, multi-tool interactions.