Academic

Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

arXiv:2602.24110v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as completely erroneous ones. This coarse feedback signal causes the model to discard valuable largely correct rollouts, leading to a degradation in rollout diversity that prematurely narrows the exploration space. Process Reward Models have demonstrated efficacy in providing reliable step-wise verification for test-time scaling, naively integrating these signals into RLVR as dense rewards proves ineffective.Prior methods attempt to introduce off-policy guided whole-trajectory replacement that often outside the policy model's distribution, but still fail to utilize the largely correct rollouts generated

arXiv:2602.24110v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as completely erroneous ones. This coarse feedback signal causes the model to discard valuable largely correct rollouts, leading to a degradation in rollout diversity that prematurely narrows the exploration space. Process Reward Models have demonstrated efficacy in providing reliable step-wise verification for test-time scaling, naively integrating these signals into RLVR as dense rewards proves ineffective.Prior methods attempt to introduce off-policy guided whole-trajectory replacement that often outside the policy model's distribution, but still fail to utilize the largely correct rollouts generated by the model itself and thus do not effectively mitigate the narrowing of the exploration space. To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained, step-wise off-policy rectification. By applying precise refinement on partially correct rollout, our method effectively salvages partially correct trajectories and increases diversity score by 13.5%, thereby sustaining a broad exploration space. Extensive experiments demonstrate that our approach establishes new state-of-the-art results, achieving an average accuracy of 46.6% on math reasoning and exhibiting robust generalization with 53.4% accuracy on out-of-distribution reasoning tasks.

Executive Summary

This article proposes SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework for enhancing the exploration capabilities of Large Reasoning Models in the context of Reinforcement Learning from Verifiable Rewards (RLVR). SCOPE addresses the limitation of standard outcome-based supervision by utilizing Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applying fine-grained, step-wise off-policy rectification. The approach effectively salvages partially correct trajectories and increases diversity score, sustaining a broad exploration space. Experimental results demonstrate the efficacy of SCOPE, achieving state-of-the-art results in math reasoning and out-of-distribution reasoning tasks.

Key Points

  • SCOPE addresses the limitation of standard outcome-based supervision in RLVR
  • SCOPE utilizes Process Reward Models to refine partially correct rollouts
  • SCOPE increases diversity score and sustains a broad exploration space

Merits

Strength in addressing exploration degradation

SCOPE effectively salvages partially correct trajectories, preventing premature narrowing of the exploration space

Robustness and generalization

SCOPE achieves state-of-the-art results in math reasoning and demonstrates robust generalization with high accuracy on out-of-distribution reasoning tasks

Demerits

Limited scalability

The effectiveness of SCOPE may be limited by the computational resources required for Process Reward Models and fine-grained off-policy rectification

Dependence on Process Reward Models

SCOPE relies on Process Reward Models, which may not be readily available or easily trained for all domains

Expert Commentary

The article presents a novel and compelling approach to addressing the exploration degradation issue in RLVR. SCOPE's utilization of Process Reward Models and fine-grained off-policy rectification demonstrates a deep understanding of the challenges in RLVR and the importance of refining partially correct rollouts. The experimental results are impressive, and the approach has significant implications for the development of more robust and generalizable reasoning systems. However, the article could benefit from a more detailed discussion of the computational resources required for SCOPE and the potential limitations of Process Reward Models.

Recommendations

  • Future research should focus on scaling SCOPE to larger and more complex domains
  • Investigation into the transferability of SCOPE to other domains and applications is warranted

Sources