Academic

Fast and Effective On-policy Distillation from Reasoning Prefixes

arXiv:2602.15260v1 Announce Type: new Abstract: On-policy distillation (OPD), which samples trajectories from the student model and supervises them with a teacher at the token level, avoids relying solely on verifiable terminal rewards and can yield better generalization than off-policy distillation. However, OPD requires expensive on-the-fly sampling of the student policy during training, which substantially increases training cost, especially for long responses. Our initial analysis shows that, during OPD, training signals are often concentrated in the prefix of each output, and that even a short teacher-generated prefix can significantly help the student produce the correct answer. Motivated by these observations, we propose a simple yet effective modification of OPD: we apply the distillation objective only to prefixes of student-generated outputs and terminate each sampling early during distillation. Experiments on a suite of AI-for-Math and out-of-domain benchmarks show that on-

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman · February 19, 2026 · 1 min read · 16 views

#cs.LG #cs.AI

Executive Summary

The article proposes a modification to on-policy distillation (OPD) by applying the distillation objective only to prefixes of student-generated outputs and terminating each sampling early during distillation. This approach, known as on-policy prefix distillation, significantly reduces training costs while maintaining performance. Experiments on AI-for-Math and out-of-domain benchmarks demonstrate the efficacy of this method, reducing training FLOP by 2x-47x without compromising on accuracy. This innovation has the potential to accelerate the development of AI models, particularly in resource-constrained environments. By leveraging the observation that training signals are often concentrated in the prefix of each output, the authors have developed a practical solution that can be widely adopted in the field of artificial intelligence.

Key Points

▸ On-policy distillation (OPD) is a method that samples trajectories from the student model and supervises them with a teacher at the token level.
▸ The proposed modification, on-policy prefix distillation, applies the distillation objective only to prefixes of student-generated outputs and terminates each sampling early during distillation.
▸ Experiments demonstrate that on-policy prefix distillation reduces training FLOP by 2x-47x without compromising on accuracy.

Merits

Efficiency

The proposed modification significantly reduces training costs by terminating each sampling early during distillation.

Effectiveness

On-policy prefix distillation maintains performance comparable to full OPD, as demonstrated by experiments on AI-for-Math and out-of-domain benchmarks.

Demerits

Limited Generalizability

The proposed method may not be directly applicable to scenarios where prefix information is not sufficient to determine the correctness of the output.

Potential Overfitting

The early termination of sampling during distillation may lead to overfitting if not properly regularized.

Expert Commentary

The proposed modification to on-policy distillation is a significant contribution to the field of artificial intelligence. By leveraging the observation that training signals are often concentrated in the prefix of each output, the authors have developed a practical solution that can be widely adopted. The innovation has the potential to accelerate the development of AI models, particularly in resource-constrained environments. However, it is essential to carefully consider the potential limitations of the proposed method, such as limited generalizability and potential overfitting. Further research is needed to fully explore the implications of this innovation and to identify potential applications.

Recommendations

✓ Future research should focus on exploring the potential applications of on-policy prefix distillation in various AI domains, including natural language processing, computer vision, and robotics.
✓ The proposed method should be further evaluated in scenarios where prefix information is not sufficient to determine the correctness of the output, to assess its generalizability and robustness.

Sources

arXiv - cs.LG

Something extraordinary is coming.

Fast and Effective On-policy Distillation from Reasoning Prefixes

AI Commentary

Executive Summary

Key Points

Merits

Efficiency

Effectiveness

Demerits

Limited Generalizability

Potential Overfitting

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.