Skip to main content
A

Atticus Wang, Iv\'an Arcuschin, Arthur Conmy

Articles by Atticus Wang, Iv\'an Arcuschin, Arthur Conmy

Academic · 1 min

Automatically Finding Reward Model Biases

arXiv:2602.15222v1 Announce Type: new Abstract: Reward models are central to large language model (LLM) post-training. However, past work has shown that they can reward spurious …

Atticus Wang, Iv\'an Arcuschin, Arthur Conmy
7 views