Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance
arXiv:2602.24110v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large …
Yanwei Ren, Haotian Zhang, Likang Xiao, Xikai Zhang, Jiaxing Huang, Jiayan Qiu, Baosheng Yu, Quan Chen, Liu Liu
3 views