Skip to main content
Academic

Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training

arXiv:2602.20532v1 Announce Type: cross Abstract: Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability

arXiv:2602.20532v1 Announce Type: cross Abstract: Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.

Executive Summary

This article proposes ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). The framework dynamically selects training problems from large problem banks by optimizing for expected policy performance improvement. Empirical results demonstrate improved training stability and efficiency, achieving relative gains of up to 80% speedup and outperforming uniform sampling and strong curriculum baselines. While ACTOR-CURATOR offers a promising approach for LLM post-training, its practical applicability and scalability in real-world scenarios remain to be seen. Further research is needed to fully realize its potential.

Key Points

  • ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks
  • The framework optimizes for expected policy performance improvement using online stochastic mirror descent
  • It achieves improved training stability and efficiency, outperforming uniform sampling and strong curriculum baselines

Merits

Strength in Scalability

ACTOR-CURATOR's ability to efficiently select training problems from large problem banks makes it a scalable solution for LLM post-training

Improved Training Stability

The framework's optimization for expected policy performance improvement leads to improved training stability and efficiency

Demerits

Limited Generalizability

The framework's performance may be limited to specific problem domains or LLM architectures

Dependency on Large Problem Banks

The effectiveness of ACTOR-CURATOR relies on the availability of large problem banks, which may not be feasible in all scenarios

Expert Commentary

ACTOR-CURATOR represents a significant advancement in the field of LLM post-training, leveraging reinforcement learning techniques to optimize policy performance improvement. However, further research is needed to fully understand its limitations and potential pitfalls. For instance, the framework's dependency on large problem banks may limit its generalizability and applicability in real-world scenarios. Nevertheless, the article's empirical results are promising, and ACTOR-CURATOR has the potential to become a widely adopted solution in the field.

Recommendations

  • Future research should focus on exploring the framework's adaptability to different problem domains and LLM architectures
  • Investigating the potential of ACTOR-CURATOR in real-world scenarios, such as large-scale AI systems and intelligent decision-making applications

Sources