Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum
arXiv:2603.18325v1 Announce Type: new Abstract: Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the curren
arXiv:2603.18325v1 Announce Type: new Abstract: Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.
Executive Summary
This article proposes a novel approach to training language models, dubbed autocurriculum, which leverages the model's performance to decide which problems to focus on during training. The authors demonstrate that autocurriculum can significantly reduce the costs associated with training chain-of-thought reasoning models, both in terms of data and compute. Specifically, autocurriculum is shown to require exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, and decouples the computational cost from the quality of the reference model in reinforcement learning fine-tuning. These improvements arise from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples. The authors' findings have important implications for the development and deployment of language models in a variety of applications.
Key Points
- ▸ Autocurriculum is a novel approach to training language models that leverages the model's performance to decide which problems to focus on during training.
- ▸ Autocurriculum can significantly reduce the costs associated with training chain-of-thought reasoning models, both in terms of data and compute.
- ▸ Autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, and decouples the computational cost from the quality of the reference model in reinforcement learning fine-tuning.
Merits
Strengths
The authors provide a rigorous and well-reasoned analysis of the benefits of autocurriculum, drawing on classical techniques from boosting and learning from counterexamples. The results are impressive, with significant reductions in costs and computational requirements.
Demerits
Limitations
The authors assume a uniform distribution of prompts, which may not be representative of real-world scenarios. Additionally, the results are based on a specific implementation of autocurriculum, which may not be generalizable to other settings.
Expert Commentary
The article represents a significant contribution to the field of natural language processing, building on the concept of chain-of-thought reasoning and exploring a novel approach to training language models. The authors' findings are impressive, with significant reductions in costs and computational requirements. However, further research is needed to fully understand the implications of autocurriculum and to explore its potential applications. Additionally, the authors' assumption of a uniform distribution of prompts may limit the generalizability of the results. Nevertheless, the article represents a valuable contribution to the field, and its findings have important implications for the development and deployment of language models.
Recommendations
- ✓ Further research is needed to fully understand the implications of autocurriculum and to explore its potential applications.
- ✓ The authors should investigate the robustness of autocurriculum to different distributions of prompts and explore its potential applications in a variety of domains.