Improving Search Agent with One Line of Code
arXiv:2603.10069v1 Announce Type: new Abstract: Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to po
arXiv:2603.10069v1 Announce Type: new Abstract: Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).
Executive Summary
The article proposes SAPO, a one-line code modification to standard GRPO, to address the critical training instability of Importance Sampling Distribution Drift (ISDD) in Tool-based Agentic Reinforcement Learning (TARL). SAPO stabilizes training via a conditional token-level KL constraint, selectively penalizing the KL divergence between the current and old policies. This leads to a 10.6% absolute improvement (+31.5% relative) over Search-R1 across seven QA benchmarks, with consistent gains across varying model scales and families. The proposed solution ensures immediate deployability and demonstrates the potential to improve the performance of search agents in multi-turn information-seeking processes.
Key Points
- ▸ SAPO addresses the critical training instability of ISDD in TARL.
- ▸ SAPO stabilizes training via a conditional token-level KL constraint.
- ▸ SAPO achieves a 10.6% absolute improvement (+31.5% relative) over Search-R1.
Merits
Strength
Immediate deployability and ease of implementation.
Improved performance
Consistent gains across varying model scales and families.
Robustness
Stabilizes training via a conditional token-level KL constraint.
Demerits
Limitation
The proposed solution may not generalize to other TARL algorithms.
Assumes access to computational resources
Implementation may require significant computational power.
Expert Commentary
The article presents a well-structured and well-executed solution to a critical training instability in TARL. The proposed SAPO algorithm demonstrates a significant improvement in search agent performance and ensures immediate deployability. However, the solution's generalizability to other TARL algorithms and its dependence on computational resources are limitations that need to be addressed. Nevertheless, the development of SAPO is a significant contribution to the field of TARL and has the potential to improve the performance of search agents in multi-turn information-seeking processes.
Recommendations
- ✓ Further research is needed to investigate the generalizability of SAPO to other TARL algorithms.
- ✓ Implementation of SAPO in real-world scenarios is necessary to assess its practical feasibility.