Academic

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

Tongzhou Wu, Yuhao Wang, Xinyu Ma, Xiuqiang He, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao · March 7, 2026 · 1 min read · 20 views

#cs.AI

arXiv:2603.01152v1 Announce Type: new Abstract: Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework DeepResearch-R1 that supports (1) multi-turn web interactions, (2) different reinforcement learning (RL) approaches, and (3) different reward models such as rule-based outcome reward and LLM-as-judge feedback. Finally, empirical results demonstrate that agents trained on DeepResearch-9K under our DeepResearch-R1 achieve state-of-the-art results on challenging deep-research benchmarks. We release the DeepResearch-9K dataset on https://huggingface.co/datasets/artillerywu/DeepResearch-9K and the code of DeepResearch-R1 on https://github.com/Applied-Machine-Learning-Lab/DeepResearch-R1.

Executive Summary

The article introduces DeepResearch-9K, a large-scale dataset for deep-research agents, and DeepResearch-R1, an open-source training framework. The dataset consists of 9000 questions, high-quality search trajectories, and verifiable answers, while the framework supports multi-turn web interactions, reinforcement learning, and different reward models. The authors demonstrate state-of-the-art results on challenging deep-research benchmarks and release the dataset and framework for public use.

Key Points

▸ Introduction of DeepResearch-9K dataset with 9000 questions and high-quality search trajectories
▸ Development of DeepResearch-R1, an open-source training framework for deep-research agents
▸ Empirical results demonstrating state-of-the-art performance on challenging deep-research benchmarks

Merits

Comprehensive Dataset

DeepResearch-9K provides a large-scale and challenging dataset for deep-research agents, addressing the lack of real-world difficulty in existing datasets.

Flexible Training Framework

DeepResearch-R1 supports various reinforcement learning approaches and reward models, allowing for flexible and adaptable training of deep-research agents.

Demerits

Limited Generalizability

The dataset and framework may not generalize well to other domains or tasks, potentially limiting their applicability and transferability.

Expert Commentary

The introduction of DeepResearch-9K and DeepResearch-R1 marks a significant step forward in the development of deep-research agents. The comprehensive dataset and flexible training framework provide a solid foundation for advancing the capabilities of these agents. However, further research is needed to address the limitations and challenges associated with explainability, transparency, and generalizability. The potential implications of this work are far-reaching, with potential applications in various domains where deep-research agents can provide valuable insights and support.

Recommendations

✓ Future research should focus on improving the explainability and transparency of deep-research agents
✓ The development of more diverse and generalizable datasets and frameworks is necessary to advance the field

Sources

arXiv - cs.AI

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Dataset

Flexible Training Framework

Demerits

Limited Generalizability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs