Academic

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang · February 18, 2026 · 1 min read · 8 views

#cs.CL #cs.AI #cs.IR #cs.LG

arXiv:2602.14257v1 Announce Type: new Abstract: While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench-leaderboard.

Executive Summary

The article introduces AD-Bench, a novel benchmark designed to evaluate the performance of Large Language Model (LLM) agents in real-world advertising and marketing analytics. Unlike existing benchmarks that rely on idealized simulations, AD-Bench is built from real user requests and incorporates multi-round, multi-tool interactions. The benchmark categorizes tasks into three difficulty levels and provides verifiable reference answers and tool-call trajectories. Experiments with Gemini-3-Pro reveal significant performance gaps, particularly in complex scenarios, highlighting the need for further improvement in LLM agents for specialized domains.

Key Points

▸ AD-Bench is designed to evaluate LLM agents in real-world advertising and marketing analytics.
▸ The benchmark includes real user requests and categorizes tasks into three difficulty levels.
▸ Experiments show that state-of-the-art models like Gemini-3-Pro have substantial capability gaps in complex scenarios.
▸ AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents.
▸ The benchmark is available publicly, including a leaderboard and code.

Merits

Real-World Relevance

AD-Bench addresses a critical gap in evaluating LLM agents by focusing on real-world business requirements in advertising and marketing analytics.

Comprehensive Evaluation

The benchmark categorizes tasks into three difficulty levels, providing a nuanced assessment of agents' capabilities.

Transparency and Accessibility

The availability of the leaderboard and code ensures transparency and encourages further research and improvement.

Demerits

Limited Scope

The benchmark is specific to advertising and marketing analytics, which may limit its applicability to other domains.

Performance Gaps

The significant performance drops in complex scenarios indicate that current LLM agents are not yet fully capable in specialized domains.

Dependence on Expert Input

The benchmark relies on domain experts for verifiable reference answers, which could introduce bias or subjectivity.

Expert Commentary

The introduction of AD-Bench marks a significant step forward in the evaluation of LLM agents for real-world applications. The benchmark's focus on real-world business requirements and its comprehensive evaluation framework provide a robust tool for assessing the capabilities of AI agents in specialized domains. However, the performance gaps identified in complex scenarios underscore the need for continued research and development. The reliance on domain experts for reference answers, while ensuring accuracy, also introduces potential biases that need to be addressed. Overall, AD-Bench sets a new standard for evaluating AI agents in advertising and marketing analytics, and its public availability will undoubtedly foster further advancements in the field.

Recommendations

✓ Expand the benchmark to include a broader range of domains beyond advertising and marketing analytics.
✓ Develop methods to mitigate potential biases introduced by domain experts in the reference answers.
✓ Encourage further research to address the performance gaps identified in complex scenarios, particularly in multi-round, multi-tool interactions.

Sources

arXiv - cs.CL

Something extraordinary is coming.

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

AI Commentary

Executive Summary

Key Points

Merits

Real-World Relevance

Comprehensive Evaluation

Transparency and Accessibility

Demerits

Limited Scope

Performance Gaps

Dependence on Expert Input

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.