Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
arXiv:2602.21420v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, …