Academic

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

arXiv:2603.03078v1 Announce Type: new Abstract: Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent's self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly ex

arXiv:2603.03078v1 Announce Type: new Abstract: Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent's self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid-policy Agentic Rollout, and (ii) Retrieval-aware Policy Optimization. Specifically, we propose a Hybrid-policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off-policy step-level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.

Executive Summary

This study proposes Retrieval-Augmented Policy Optimization (RAPO), a novel reinforcement learning framework that expands exploration for large language model-based (LLM) agents. RAPO overcomes the limitations of existing Agentic RL methods by introducing retrieval to enhance exploration during training. The framework consists of two phases: Hybrid-policy Agentic Rollout and Retrieval-aware Policy Optimization. This approach allows agents to reason over retrieved off-policy step-level traces, dynamically extending their receptive field and enabling broader exploration. RAPO achieves significant improvements in performance and training efficiency, outperforming existing methods by 5.0% on average across fourteen datasets. This study contributes to the development of more effective and efficient LLM agents for complex tasks.

Key Points

  • RAPO introduces retrieval to enhance exploration in Agentic RL
  • The framework consists of two phases: Hybrid-policy Agentic Rollout and Retrieval-aware Policy Optimization
  • RAPO achieves significant improvements in performance and training efficiency

Merits

Strength in Exploration

RAPO's ability to expand exploration through retrieval allows agents to discover new reasoning perspectives and tackle complex tasks more effectively

Improved Training Efficiency

RAPO's faster training efficiency enables researchers to explore more complex tasks and scenarios, leading to breakthroughs in LLM agent development

Demerits

Limited to LLM Agents

RAPO's focus on LLM agents may limit its applicability to other agent types, such as robotic or physical agents

Dependence on Retrieval

RAPO's reliance on retrieval may introduce additional computational overhead and require significant computational resources

Expert Commentary

The study's focus on exploration in Agentic RL is timely and relevant, given the increasing complexity of tasks that LLM agents are expected to perform. RAPO's innovative approach to retrieval and policy optimization has the potential to significantly impact the field of LLM agent development. However, further research is needed to fully understand the limitations and dependencies of RAPO. Additionally, its applicability to other agent types and its potential impact on real-world applications require careful consideration.

Recommendations

  • Future research should explore the application of RAPO to other agent types and its potential impact on real-world scenarios
  • Developers should investigate the computational overhead and resources required for RAPO, and explore techniques to mitigate them

Sources