Automatically Finding Reward Model Biases
arXiv:2602.15222v1 Announce Type: new Abstract: Reward models are central to large language model (LLM) post-training. However, past work has shown that they can reward spurious …
Atticus Wang, Iv\'an Arcuschin, Arthur Conmy
7 views