Skip to main content
Academic

How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

arXiv:2602.19526v1 Announce Type: new Abstract: Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a

arXiv:2602.19526v1 Announce Type: new Abstract: Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.

Executive Summary

This article presents a comprehensive study on training deep research agents using reinforcement learning (RL) in the search-R1 paradigm. The authors decouple the RL process into three dimensions: prompt template, reward function, and policy optimization. They find that the Fast Thinking template outperforms the Slow Thinking template, the F1-based reward underperforms the Expected Maximization (EM) reward due to training collapse, and REINFORCE outperforms Proximal Policy Optimization (PPO) with fewer search actions. The authors introduce Search-R1++, a new baseline that improves performance over Search-R1. The study contributes to the development of more principled and reliable RL training strategies for deep research systems.

Key Points

  • Fast Thinking template yields better performance and stability than Slow Thinking template
  • F1-based reward underperforms EM reward due to answer avoidance, but can be improved with action-level penalties
  • REINFORCE outperforms PPO with fewer search actions

Merits

Strength in Methodology

The study employs a systematic approach, decoupling the RL process into three dimensions and providing a comprehensive evaluation of different prompt templates, reward functions, and policy optimization methods.

Insights into RL Training

The study provides valuable insights into the role of RL in training deep research agents, highlighting the importance of prompt templates, reward functions, and policy optimization methods in achieving better performance and stability.

Demerits

Limited Dataset

The study is limited to a specific dataset, which may not be representative of all deep research tasks and may not generalize well to other domains.

Dependence on Hyperparameters

The performance of the model is heavily dependent on the choice of hyperparameters, which can be challenging to optimize and may require further research.

Expert Commentary

This study makes a significant contribution to the field of deep research by providing a comprehensive evaluation of different prompt templates, reward functions, and policy optimization methods in RL training. The authors' findings demonstrate the importance of understanding the role of these factors in achieving better performance and stability in deep research systems. However, the study is not without limitations, and further research is needed to address the challenges of hyperparameter tuning and dataset generalization. Nevertheless, the study's findings have important implications for the development of more effective RL training strategies and AI policies and regulations.

Recommendations

  • Future studies should investigate the transferability of the authors' findings to other deep research tasks and domains.
  • Researchers should explore the development of more explainable RL training strategies that can provide insights into the role of different prompt templates, reward functions, and policy optimization methods in achieving better performance and stability.

Sources