Academic

Improving Search Agent with One Line of Code

Jian Li, Dongsheng Chen, Zhenhua Xu, Yizhang Jin, Jiafu Wu, Chengjie Wang, Xiaotong Yuan, Yabiao Wang · March 12, 2026 · 1 min read · 7 views

#cs.LG #cs.CL

arXiv:2603.10069v1 Announce Type: new Abstract: Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).

Executive Summary

The article proposes SAPO, a one-line code modification to standard GRPO, to address the critical training instability of Importance Sampling Distribution Drift (ISDD) in Tool-based Agentic Reinforcement Learning (TARL). SAPO stabilizes training via a conditional token-level KL constraint, selectively penalizing the KL divergence between the current and old policies. This leads to a 10.6% absolute improvement (+31.5% relative) over Search-R1 across seven QA benchmarks, with consistent gains across varying model scales and families. The proposed solution ensures immediate deployability and demonstrates the potential to improve the performance of search agents in multi-turn information-seeking processes.

Key Points

▸ SAPO addresses the critical training instability of ISDD in TARL.
▸ SAPO stabilizes training via a conditional token-level KL constraint.
▸ SAPO achieves a 10.6% absolute improvement (+31.5% relative) over Search-R1.

Merits

Strength

Immediate deployability and ease of implementation.

Improved performance

Consistent gains across varying model scales and families.

Robustness

Stabilizes training via a conditional token-level KL constraint.

Demerits

Limitation

The proposed solution may not generalize to other TARL algorithms.

Assumes access to computational resources

Implementation may require significant computational power.

Expert Commentary

The article presents a well-structured and well-executed solution to a critical training instability in TARL. The proposed SAPO algorithm demonstrates a significant improvement in search agent performance and ensures immediate deployability. However, the solution's generalizability to other TARL algorithms and its dependence on computational resources are limitations that need to be addressed. Nevertheless, the development of SAPO is a significant contribution to the field of TARL and has the potential to improve the performance of search agents in multi-turn information-seeking processes.

Recommendations

✓ Further research is needed to investigate the generalizability of SAPO to other TARL algorithms.
✓ Implementation of SAPO in real-world scenarios is necessary to assess its practical feasibility.

Sources

arXiv - cs.LG

Improving Search Agent with One Line of Code

AI Commentary

Executive Summary

Key Points

Merits

Strength

Improved performance

Robustness

Demerits

Limitation

Assumes access to computational resources

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs