Fast and Effective On-policy Distillation from Reasoning Prefixes
arXiv:2602.15260v1 Announce Type: new Abstract: On-policy distillation (OPD), which samples trajectories from the student model and supervises them with a teacher at the token level, …
Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman
17 views