One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models
arXiv:2603.03291v1 Announce Type: cross Abstract: Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable …
Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, Nick Haber
23 views