Reinforcement-aware Knowledge Distillation for LLM Reasoning
arXiv:2602.22495v1 Announce Type: new Abstract: Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Regi
arXiv:2602.22495v1 Announce Type: new Abstract: Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.
Executive Summary
This article proposes Reinforcement-learning-aware Distillation (RLAD), a novel knowledge distillation method that addresses the limitations of existing approaches when combined with reinforcement learning. By selectively imitating the teacher during reinforcement learning, RLAD guides the student model towards the teacher only when it improves the current policy update. The core component, Trust Region Ratio Distillation (TRRD), replaces the traditional KL regularizer with a likelihood-ratio objective anchored to a teacher-old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts. The authors demonstrate the effectiveness of RLAD across diverse logic reasoning and math benchmarks, outperforming offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation. The proposed method has the potential to significantly improve the efficiency and accuracy of long chain-of-thought reasoning large language models.
Key Points
- ▸ RLAD selectively imitates the teacher during reinforcement learning
- ▸ TRRD replaces the traditional KL regularizer with a likelihood-ratio objective
- ▸ RLAD outperforms existing methods across diverse logic reasoning and math benchmarks
Merits
Advantage-aware distillation
TRRD is anchored to a teacher-old-policy mixture, allowing for advantage-aware distillation that balances exploration, exploitation, and imitation
Trust-region-bounded distillation
TRRD provides a trust-region-bounded distillation objective that prevents the student model from deviating too far from the teacher model
Improved efficiency and accuracy
RLAD has the potential to significantly improve the efficiency and accuracy of long chain-of-thought reasoning large language models
Demerits
Limited evaluation metrics
The article primarily focuses on logic reasoning and math benchmarks, and it would be beneficial to evaluate RLAD on a broader range of tasks and metrics
Dependence on reinforcement learning
RLAD relies on reinforcement learning, which can be computationally expensive and may not be suitable for all applications
Expert Commentary
The article presents a well-motivated and novel approach to knowledge distillation for large language models. The proposed RLAD method addresses the limitations of existing approaches by selectively imitating the teacher during reinforcement learning. The TRRD component is particularly innovative, as it replaces the traditional KL regularizer with a likelihood-ratio objective anchored to a teacher-old-policy mixture. The authors provide a thorough evaluation of RLAD across diverse logic reasoning and math benchmarks, demonstrating its effectiveness in improving the efficiency and accuracy of long chain-of-thought reasoning large language models. However, the article could benefit from a more detailed discussion of the limitations of RLAD, particularly its dependence on reinforcement learning and limited evaluation metrics. Nevertheless, the proposed method has the potential to significantly impact the field of natural language processing and AI, and its development is an important step towards improving the efficiency and accuracy of large language models.
Recommendations
- ✓ Further evaluation of RLAD on a broader range of tasks and metrics
- ✓ Investigation of RLAD's performance on more complex and real-world applications