Academic

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

arXiv:2603.10101v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CL

Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang · March 12, 2026 · 1 min read · 34 views

#cs.LG #cs.AI #cs.CL

Executive Summary

The article introduces CLIPO, a Contrastive Learning mechanism in Policy Optimization, designed to generalize Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). CLIPO addresses the limitations of RLVR by incorporating a contrastive loss function to optimize successful rollouts, thereby capturing the invariant structure of correct reasoning paths. This approach mitigates step-level reasoning inconsistencies and hallucinatory artifacts, resulting in improved generalization and robustness of LLMs. Experimental results demonstrate the effectiveness of CLIPO across diverse reasoning benchmarks, outperforming RLVR baselines.

Key Points

▸ CLIPO incorporates Contrastive Learning into Policy Optimization to improve RLVR
▸ The approach captures the invariant structure of correct reasoning paths
▸ CLIPO mitigates step-level reasoning inconsistencies and hallucinatory artifacts

Merits

Improved Generalization

CLIPO's contrastive learning mechanism enables LLMs to generalize better across diverse reasoning benchmarks

Robustness

The approach effectively suppresses hallucinatory artifacts, leading to more robust policy optimization

Demerits

Computational Complexity

The incorporation of contrastive learning may increase computational requirements

Limited Exploration

CLIPO's focus on successful rollouts might limit exploration of alternative reasoning paths

Expert Commentary

The introduction of CLIPO marks a significant advancement in the field of RLVR, as it addresses the long-standing issue of hallucination and answer-copying in LLMs. By leveraging contrastive learning, CLIPO provides a more robust and generalizable approach to policy optimization. The experimental results demonstrate the effectiveness of CLIPO, and its potential applications in various domains are substantial. However, further research is needed to fully explore the capabilities and limitations of CLIPO, particularly in terms of computational complexity and exploration of alternative reasoning paths.

Recommendations

✓ Further investigation into the computational requirements and potential optimizations of CLIPO
✓ Exploration of CLIPO's applications in high-stakes decision-making domains, such as healthcare and finance

Sources

arXiv - cs.LG

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

AI Commentary

Executive Summary

Key Points

Merits

Improved Generalization

Robustness

Demerits

Computational Complexity

Limited Exploration

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs