CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
arXiv:2603.10101v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CL
arXiv:2603.10101v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.
Executive Summary
The article introduces CLIPO, a Contrastive Learning mechanism in Policy Optimization, designed to generalize Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). CLIPO addresses the limitations of RLVR by incorporating a contrastive loss function to optimize successful rollouts, thereby capturing the invariant structure of correct reasoning paths. This approach mitigates step-level reasoning inconsistencies and hallucinatory artifacts, resulting in improved generalization and robustness of LLMs. Experimental results demonstrate the effectiveness of CLIPO across diverse reasoning benchmarks, outperforming RLVR baselines.
Key Points
- ▸ CLIPO incorporates Contrastive Learning into Policy Optimization to improve RLVR
- ▸ The approach captures the invariant structure of correct reasoning paths
- ▸ CLIPO mitigates step-level reasoning inconsistencies and hallucinatory artifacts
Merits
Improved Generalization
CLIPO's contrastive learning mechanism enables LLMs to generalize better across diverse reasoning benchmarks
Robustness
The approach effectively suppresses hallucinatory artifacts, leading to more robust policy optimization
Demerits
Computational Complexity
The incorporation of contrastive learning may increase computational requirements
Limited Exploration
CLIPO's focus on successful rollouts might limit exploration of alternative reasoning paths
Expert Commentary
The introduction of CLIPO marks a significant advancement in the field of RLVR, as it addresses the long-standing issue of hallucination and answer-copying in LLMs. By leveraging contrastive learning, CLIPO provides a more robust and generalizable approach to policy optimization. The experimental results demonstrate the effectiveness of CLIPO, and its potential applications in various domains are substantial. However, further research is needed to fully explore the capabilities and limitations of CLIPO, particularly in terms of computational complexity and exploration of alternative reasoning paths.
Recommendations
- ✓ Further investigation into the computational requirements and potential optimizations of CLIPO
- ✓ Exploration of CLIPO's applications in high-stakes decision-making domains, such as healthcare and finance