Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
arXiv:2602.20722v1 Announce Type: new Abstract: Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning …
Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun
23 views