SteerRM: Debiasing Reward Models via Sparse Autoencoders
arXiv:2603.12795v1 Announce Type: new Abstract: Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses …
Mengyuan Sun, Zhuohao Yu, Weizheng Gu, Shikun Zhang, Wei Ye
9 views