DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent
arXiv:2603.01152v1 Announce Type: new Abstract: Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework De
arXiv:2603.01152v1 Announce Type: new Abstract: Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework DeepResearch-R1 that supports (1) multi-turn web interactions, (2) different reinforcement learning (RL) approaches, and (3) different reward models such as rule-based outcome reward and LLM-as-judge feedback. Finally, empirical results demonstrate that agents trained on DeepResearch-9K under our DeepResearch-R1 achieve state-of-the-art results on challenging deep-research benchmarks. We release the DeepResearch-9K dataset on https://huggingface.co/datasets/artillerywu/DeepResearch-9K and the code of DeepResearch-R1 on https://github.com/Applied-Machine-Learning-Lab/DeepResearch-R1.
Executive Summary
The article introduces DeepResearch-9K, a large-scale dataset for deep-research agents, and DeepResearch-R1, an open-source training framework. The dataset consists of 9000 questions, high-quality search trajectories, and verifiable answers, while the framework supports multi-turn web interactions, reinforcement learning, and different reward models. The authors demonstrate state-of-the-art results on challenging deep-research benchmarks and release the dataset and framework for public use.
Key Points
- ▸ Introduction of DeepResearch-9K dataset with 9000 questions and high-quality search trajectories
- ▸ Development of DeepResearch-R1, an open-source training framework for deep-research agents
- ▸ Empirical results demonstrating state-of-the-art performance on challenging deep-research benchmarks
Merits
Comprehensive Dataset
DeepResearch-9K provides a large-scale and challenging dataset for deep-research agents, addressing the lack of real-world difficulty in existing datasets.
Flexible Training Framework
DeepResearch-R1 supports various reinforcement learning approaches and reward models, allowing for flexible and adaptable training of deep-research agents.
Demerits
Limited Generalizability
The dataset and framework may not generalize well to other domains or tasks, potentially limiting their applicability and transferability.
Expert Commentary
The introduction of DeepResearch-9K and DeepResearch-R1 marks a significant step forward in the development of deep-research agents. The comprehensive dataset and flexible training framework provide a solid foundation for advancing the capabilities of these agents. However, further research is needed to address the limitations and challenges associated with explainability, transparency, and generalizability. The potential implications of this work are far-reaching, with potential applications in various domains where deep-research agents can provide valuable insights and support.
Recommendations
- ✓ Future research should focus on improving the explainability and transparency of deep-research agents
- ✓ The development of more diverse and generalizable datasets and frameworks is necessary to advance the field