Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
arXiv:2602.18037v1 Announce Type: cross Abstract: Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language …
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama
3 views