Skip to main content
Academic

Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

arXiv:2602.18037v1 Announce Type: cross Abstract: Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the

arXiv:2602.18037v1 Announce Type: cross Abstract: Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter regions with higher reward accuracy. We further improve on this by proposing to use explicit GR with an efficient finite-difference estimate. Empirically, GR performs better than a KL penalty across a diverse set of RL experiments with LMs. GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.

Executive Summary

This article proposes a novel approach to prevent reward hacking in Reinforcement Learning from Human Feedback (RLHF) and Verifiable Rewards (RLVR) using gradient regularization (GR). The authors derive a theoretical connection between reward accuracy and the flatness of an optimum, showing that GR can be used to train models towards flatter regions with higher reward accuracy. Empirical results demonstrate that GR performs better than a KL penalty across various RL experiments with Language Models. The article's findings have significant implications for the development of more robust and reliable RL systems.

Key Points

  • Gradient regularization (GR) prevents reward hacking in RLHF and RLVR by training models towards flatter regions with higher reward accuracy.
  • The authors derive a theoretical connection between reward accuracy and the flatness of an optimum, providing a novel framework for understanding the relationship between reward hacking and model training.
  • Empirical results demonstrate the effectiveness of GR in improving reward accuracy and preventing reward hacking across various RL experiments with Language Models.

Merits

Strength

The article provides a novel and theoretically grounded approach to preventing reward hacking, which is a significant problem in RLHF and RLVR.

Strength

The empirical results are robust and demonstrate the effectiveness of GR in improving reward accuracy and preventing reward hacking.

Demerits

Limitation

The article assumes that the reward model is accurate, which may not always be the case in real-world RL applications.

Limitation

The authors do not provide a detailed analysis of the computational complexity of GR, which may be a concern in large-scale RL applications.

Expert Commentary

The article provides a significant contribution to the field of RL, highlighting the limitations of existing approaches to preventing reward hacking and demonstrating the effectiveness of GR. However, the article's assumption that the reward model is accurate may be a limitation in real-world applications. Additionally, the authors do not provide a detailed analysis of the computational complexity of GR, which may be a concern in large-scale RL applications. Despite these limitations, the article's findings have significant implications for the development of more robust and reliable RL systems.

Recommendations

  • Future research should focus on developing more robust and efficient gradient regularization techniques that can be applied to large-scale RL applications.
  • RL practitioners should be aware of the limitations of existing approaches to preventing reward hacking and consider using GR in their RL systems.

Sources