Skip to main content

Category

Academic

Academic · 1 min

Automatically Finding Reward Model Biases

arXiv:2602.15222v1 Announce Type: new Abstract: Reward models are central to large language model (LLM) post-training. However, past work has shown that they can reward spurious …

Atticus Wang, Iv\'an Arcuschin, Arthur Conmy
5 views
Academic · 1 min

Closing the Distribution Gap in Adversarial Training for LLMs

arXiv:2602.15238v1 Announce Type: new Abstract: Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant …

Chengzhi Hu, Jonas Dornbusch, David L\"udke, Stephan G\"unnemann, Leo Schwinn
6 views
Academic · 1 min

Fast and Effective On-policy Distillation from Reasoning Prefixes

arXiv:2602.15260v1 Announce Type: new Abstract: On-policy distillation (OPD), which samples trajectories from the student model and supervises them with a teacher at the token level, …

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman
16 views