Academic

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

arXiv:2602.17686v1 Announce Type: cross Abstract: Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achie

arXiv:2602.17686v1 Announce Type: cross Abstract: Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achieve an 11.29 percent accuracy improvement while reducing output length by 27.4 percent, surpassing both instruction-tuned variants and prior distillation methods.

Executive Summary

This article presents a novel curriculum learning framework for efficient chain-of-thought distillation, addressing the capacity mismatch between large language models and compact student models. The proposed three-stage framework utilizes structure-aware masking, Group Relative Policy Optimization (GRPO), and targeted rewriting to enable the student model to acquire progressive skills and internalize teacher knowledge. Experimental results demonstrate significant accuracy improvement and output length reduction, outperforming prior distillation methods and instruction-tuned variants.

Key Points

  • Curriculum learning framework for chain-of-thought distillation
  • Structure-aware masking and GRPO for skill acquisition
  • Targeted rewriting for internalizing teacher knowledge

Merits

Improved Accuracy

The proposed framework achieves an 11.29% accuracy improvement, demonstrating its effectiveness in distilling chain-of-thought reasoning.

Output Length Reduction

The framework reduces output length by 27.4%, making it more efficient and practical for real-world applications.

Demerits

Computational Complexity

The proposed framework involves multiple stages and optimization techniques, which may increase computational complexity and require significant resources.

Expert Commentary

The article presents a well-structured and innovative approach to addressing the challenges of chain-of-thought distillation. The use of curriculum learning, structure-aware masking, and GRPO demonstrates a deep understanding of the underlying complexities and limitations of current distillation methods. The experimental results are impressive, and the framework's ability to balance accuracy and brevity is a significant contribution to the field. However, further research is needed to fully explore the potential applications and limitations of this framework.

Recommendations

  • Further experimentation with different datasets and tasks to demonstrate the framework's generalizability
  • Investigation into the potential applications of the framework in real-world scenarios, such as educational settings or decision-support systems

Sources