On-Policy Supervised Fine-Tuning for Efficient Reasoning
arXiv:2602.13407v1 Announce Type: new Abstract: Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group-wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation-based length penalty, we show that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both correctness and conciseness. We term this simplifie
arXiv:2602.13407v1 Announce Type: new Abstract: Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group-wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation-based length penalty, we show that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both correctness and conciseness. We term this simplified training strategy on-policy SFT. Despite its simplicity, on-policy SFT consistently defines the accuracy-efficiency Pareto frontier. It reduces CoT length by up to 80 while maintaining original accuracy, surpassing more complex RL-based methods across five benchmarks. Furthermore, it significantly enhances training efficiency, reducing GPU memory usage by 50% and accelerating convergence by 70%. Our code is available at https://github.com/EIT-NLP/On-Policy-SFT.
Executive Summary
The article 'On-Policy Supervised Fine-Tuning for Efficient Reasoning' challenges the complexity of current large reasoning models (LRMs) trained with reinforcement learning (RL) for chain-of-thought reasoning. The authors argue that multi-reward objectives, while intended to optimize correctness and brevity, often destabilize training and result in suboptimal trade-offs. By simplifying the reward to a truncation-based length penalty and removing KL regularization and group-wise normalization, the authors propose a method called on-policy SFT. This approach achieves significant reductions in chain-of-thought (CoT) length (up to 80%) while maintaining accuracy, outperforming more complex RL-based methods across five benchmarks. The method also enhances training efficiency, reducing GPU memory usage by 50% and accelerating convergence by 70%. The study highlights the potential of simpler, more efficient training strategies in the development of advanced reasoning models.
Key Points
- ▸ Simplification of reward objectives in LRMs leads to more stable and efficient training.
- ▸ On-policy SFT achieves significant reductions in CoT length without sacrificing accuracy.
- ▸ The method outperforms complex RL-based approaches in both performance and efficiency.
Merits
Simplification and Efficiency
The article effectively demonstrates that simplifying the training objectives and removing complex components like KL regularization and group-wise normalization can lead to more stable and efficient training processes. This simplification not only reduces computational costs but also accelerates convergence, making the method more practical for real-world applications.
Performance Superiority
The proposed on-policy SFT method achieves superior performance compared to more complex RL-based methods, maintaining high accuracy while significantly reducing the length of chain-of-thought reasoning. This makes it a more efficient and effective approach for developing advanced reasoning models.
Empirical Validation
The study provides empirical validation through extensive benchmarking across five different datasets, demonstrating the robustness and generalizability of the on-policy SFT method. This rigorous testing adds credibility to the findings and highlights the potential for broader application.
Demerits
Limited Scope of Benchmarks
While the study benchmarks across five datasets, the scope of these benchmarks may not cover the full spectrum of potential applications for LRMs. Further testing across a more diverse set of benchmarks would strengthen the generalizability of the findings.
Potential Over-Simplification
The simplification of the reward objectives, while beneficial in terms of efficiency, may potentially overlook nuanced aspects of reasoning that more complex RL-based methods could capture. This could limit the applicability of the method in scenarios requiring highly nuanced reasoning.
Lack of Long-Term Stability Analysis
The study focuses on immediate performance and efficiency gains but does not provide a long-term stability analysis of the models trained using on-policy SFT. Understanding the long-term performance and stability of these models would be crucial for their deployment in real-world applications.
Expert Commentary
The article 'On-Policy Supervised Fine-Tuning for Efficient Reasoning' presents a compelling argument for simplifying the training objectives of large reasoning models. The authors' principled analysis identifies fundamental misalignments in the current paradigm, particularly the destabilizing effects of complex multi-reward objectives. By removing KL regularization and group-wise normalization and simplifying the reward to a truncation-based length penalty, the authors demonstrate significant improvements in both performance and efficiency. The on-policy SFT method achieves up to 80% reduction in chain-of-thought length while maintaining accuracy, outperforming more complex RL-based methods across five benchmarks. This study not only highlights the potential of simpler training strategies but also underscores the importance of efficiency in the development of advanced AI systems. The empirical validation through extensive benchmarking adds credibility to the findings, making a strong case for the adoption of on-policy SFT in practical applications. However, the study's limitations, such as the potential over-simplification of reward objectives and the lack of long-term stability analysis, warrant further investigation. Overall, the article provides valuable insights into the future of efficient and effective training methods in AI, contributing significantly to the ongoing discourse in the field.
Recommendations
- ✓ Further research should explore the long-term stability and performance of models trained using on-policy SFT to ensure their suitability for real-world applications.
- ✓ Future studies should benchmark the on-policy SFT method across a more diverse set of datasets to validate its generalizability and robustness in various scenarios.