Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training
arXiv:2602.20532v1 Announce Type: cross Abstract: Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability
arXiv:2602.20532v1 Announce Type: cross Abstract: Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.
Executive Summary
This article proposes ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). The framework dynamically selects training problems from large problem banks by optimizing for expected policy performance improvement. Empirical results demonstrate improved training stability and efficiency, achieving relative gains of up to 80% speedup and outperforming uniform sampling and strong curriculum baselines. While ACTOR-CURATOR offers a promising approach for LLM post-training, its practical applicability and scalability in real-world scenarios remain to be seen. Further research is needed to fully realize its potential.
Key Points
- ▸ ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks
- ▸ The framework optimizes for expected policy performance improvement using online stochastic mirror descent
- ▸ It achieves improved training stability and efficiency, outperforming uniform sampling and strong curriculum baselines
Merits
Strength in Scalability
ACTOR-CURATOR's ability to efficiently select training problems from large problem banks makes it a scalable solution for LLM post-training
Improved Training Stability
The framework's optimization for expected policy performance improvement leads to improved training stability and efficiency
Demerits
Limited Generalizability
The framework's performance may be limited to specific problem domains or LLM architectures
Dependency on Large Problem Banks
The effectiveness of ACTOR-CURATOR relies on the availability of large problem banks, which may not be feasible in all scenarios
Expert Commentary
ACTOR-CURATOR represents a significant advancement in the field of LLM post-training, leveraging reinforcement learning techniques to optimize policy performance improvement. However, further research is needed to fully understand its limitations and potential pitfalls. For instance, the framework's dependency on large problem banks may limit its generalizability and applicability in real-world scenarios. Nevertheless, the article's empirical results are promising, and ACTOR-CURATOR has the potential to become a widely adopted solution in the field.
Recommendations
- ✓ Future research should focus on exploring the framework's adaptability to different problem domains and LLM architectures
- ✓ Investigating the potential of ACTOR-CURATOR in real-world scenarios, such as large-scale AI systems and intelligent decision-making applications