Fast and Effective On-policy Distillation from Reasoning Prefixes
arXiv:2602.15260v1 Announce Type: new Abstract: On-policy distillation (OPD), which samples trajectories from the student model and supervises them with a teacher at the token level, avoids relying solely on verifiable terminal rewards and can yield better generalization than off-policy distillation. However, OPD requires expensive on-the-fly sampling of the student policy during training, which substantially increases training cost, especially for long responses. Our initial analysis shows that, during OPD, training signals are often concentrated in the prefix of each output, and that even a short teacher-generated prefix can significantly help the student produce the correct answer. Motivated by these observations, we propose a simple yet effective modification of OPD: we apply the distillation objective only to prefixes of student-generated outputs and terminate each sampling early during distillation. Experiments on a suite of AI-for-Math and out-of-domain benchmarks show that on-
arXiv:2602.15260v1 Announce Type: new Abstract: On-policy distillation (OPD), which samples trajectories from the student model and supervises them with a teacher at the token level, avoids relying solely on verifiable terminal rewards and can yield better generalization than off-policy distillation. However, OPD requires expensive on-the-fly sampling of the student policy during training, which substantially increases training cost, especially for long responses. Our initial analysis shows that, during OPD, training signals are often concentrated in the prefix of each output, and that even a short teacher-generated prefix can significantly help the student produce the correct answer. Motivated by these observations, we propose a simple yet effective modification of OPD: we apply the distillation objective only to prefixes of student-generated outputs and terminate each sampling early during distillation. Experiments on a suite of AI-for-Math and out-of-domain benchmarks show that on-policy prefix distillation matches the performance of full OPD while reducing training FLOP by 2x-47x.
Executive Summary
The article proposes a modification to on-policy distillation (OPD) by applying the distillation objective only to prefixes of student-generated outputs and terminating each sampling early during distillation. This approach, known as on-policy prefix distillation, significantly reduces training costs while maintaining performance. Experiments on AI-for-Math and out-of-domain benchmarks demonstrate the efficacy of this method, reducing training FLOP by 2x-47x without compromising on accuracy. This innovation has the potential to accelerate the development of AI models, particularly in resource-constrained environments. By leveraging the observation that training signals are often concentrated in the prefix of each output, the authors have developed a practical solution that can be widely adopted in the field of artificial intelligence.
Key Points
- ▸ On-policy distillation (OPD) is a method that samples trajectories from the student model and supervises them with a teacher at the token level.
- ▸ The proposed modification, on-policy prefix distillation, applies the distillation objective only to prefixes of student-generated outputs and terminates each sampling early during distillation.
- ▸ Experiments demonstrate that on-policy prefix distillation reduces training FLOP by 2x-47x without compromising on accuracy.
Merits
Efficiency
The proposed modification significantly reduces training costs by terminating each sampling early during distillation.
Effectiveness
On-policy prefix distillation maintains performance comparable to full OPD, as demonstrated by experiments on AI-for-Math and out-of-domain benchmarks.
Demerits
Limited Generalizability
The proposed method may not be directly applicable to scenarios where prefix information is not sufficient to determine the correctness of the output.
Potential Overfitting
The early termination of sampling during distillation may lead to overfitting if not properly regularized.
Expert Commentary
The proposed modification to on-policy distillation is a significant contribution to the field of artificial intelligence. By leveraging the observation that training signals are often concentrated in the prefix of each output, the authors have developed a practical solution that can be widely adopted. The innovation has the potential to accelerate the development of AI models, particularly in resource-constrained environments. However, it is essential to carefully consider the potential limitations of the proposed method, such as limited generalizability and potential overfitting. Further research is needed to fully explore the implications of this innovation and to identify potential applications.
Recommendations
- ✓ Future research should focus on exploring the potential applications of on-policy prefix distillation in various AI domains, including natural language processing, computer vision, and robotics.
- ✓ The proposed method should be further evaluated in scenarios where prefix information is not sufficient to determine the correctness of the output, to assess its generalizability and robustness.