Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
arXiv:2602.20722v1 Announce Type: new Abstract: Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.
arXiv:2602.20722v1 Announce Type: new Abstract: Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.
Executive Summary
This article proposes a novel off-policy reinforcement learning framework, Batch Adaptation Policy Optimization (BAPO), to enhance the data efficiency of large language models post-training. BAPO dynamically selects training batches, re-evaluating difficult samples and reusing high-quality ones, while maintaining a lower bound guarantee for policy improvement. Through extensive experiments, BAPO achieves significant improvements over traditional on-policy frameworks, resolving 40.7% of problems that base models consistently fail to solve. This breakthrough has far-reaching implications for the development of large language models, enabling them to tackle complex tasks more efficiently and effectively.
Key Points
- ▸ BAPO is an off-policy reinforcement learning framework that improves data efficiency in large language models post-training.
- ▸ BAPO dynamically selects training batches, re-evaluating historically difficult samples and reusing high-quality ones.
- ▸ BAPO achieves an average 12.5% improvement over GRPO across various reasoning tasks, resolving 40.7% of problems that base models fail to solve.
Merits
Strength in Addressing Experience Waste
BAPO effectively addresses the experience waste issue in traditional on-policy frameworks, greatly improving data efficiency in large language models post-training.
Demerits
Potential Overreliance on Historical Samples
BAPO's reliance on historical samples may limit its ability to adapt to rapidly changing environments or new tasks.
Expert Commentary
The introduction of BAPO offers a significant breakthrough in the field of reinforcement learning for large language models. By addressing the experience waste and reward homogeneity issues in traditional on-policy frameworks, BAPO enables large language models to tackle complex tasks more efficiently and effectively. While BAPO's reliance on historical samples may pose some limitations, its potential for improving data efficiency in large language models post-training is substantial. As researchers continue to explore and refine BAPO, we can expect to see significant advancements in the development of complex AI systems.
Recommendations
- ✓ Future research should focus on adapting BAPO to real-world applications, where complex tasks and decisions are often involved.
- ✓ Researchers should also explore ways to mitigate the limitations of BAPO, such as its potential overreliance on historical samples.