Academic

Actor-Critic Pretraining for Proximal Policy Optimization

arXiv:2602.23804v1 Announce Type: new Abstract: Reinforcement learning (RL) actor-critic algorithms enable autonomous learning but often require a large number of environment interactions, which limits their applicability in robotics. Leveraging expert data can reduce the number of required environment interactions. A common approach is actor pretraining, where the actor network is initialized via behavioral cloning on expert demonstrations and subsequently fine-tuned with RL. In contrast, the initialization of the critic network has received little attention, despite its central role in policy optimization. This paper proposes a pretraining approach for actor-critic algorithms like Proximal Policy Optimization (PPO) that uses expert demonstrations to initialize both networks. The actor is pretrained via behavioral cloning, while the critic is pretrained using returns obtained from rollouts of the pretrained policy. The approach is evaluated on 15 simulated robotic manipulation and lo

arXiv:2602.23804v1 Announce Type: new Abstract: Reinforcement learning (RL) actor-critic algorithms enable autonomous learning but often require a large number of environment interactions, which limits their applicability in robotics. Leveraging expert data can reduce the number of required environment interactions. A common approach is actor pretraining, where the actor network is initialized via behavioral cloning on expert demonstrations and subsequently fine-tuned with RL. In contrast, the initialization of the critic network has received little attention, despite its central role in policy optimization. This paper proposes a pretraining approach for actor-critic algorithms like Proximal Policy Optimization (PPO) that uses expert demonstrations to initialize both networks. The actor is pretrained via behavioral cloning, while the critic is pretrained using returns obtained from rollouts of the pretrained policy. The approach is evaluated on 15 simulated robotic manipulation and locomotion tasks. Experimental results show that actor-critic pretraining improves sample efficiency by 86.1% on average compared to no pretraining and by 30.9% to actor-only pretraining.

Executive Summary

This article proposes a pretraining approach for actor-critic algorithms, specifically Proximal Policy Optimization (PPO), using expert demonstrations to initialize both actor and critic networks. The actor is pretrained via behavioral cloning, while the critic is pretrained using returns obtained from rollouts of the pretrained policy. Experimental results show significant improvements in sample efficiency, with an average improvement of 86.1% compared to no pretraining and 30.9% compared to actor-only pretraining.

Key Points

  • Actor-critic pretraining approach for PPO using expert demonstrations
  • Initialization of both actor and critic networks
  • Significant improvements in sample efficiency

Merits

Improved Sample Efficiency

The proposed approach achieves significant improvements in sample efficiency, reducing the number of environment interactions required for training.

Demerits

Limited Evaluation

The approach is evaluated on a limited number of tasks, which may not be representative of all possible scenarios.

Expert Commentary

The proposed actor-critic pretraining approach for PPO demonstrates significant potential for improving sample efficiency in reinforcement learning tasks. By initializing both actor and critic networks using expert demonstrations, the approach can reduce the number of environment interactions required for training, making it more practical for real-world applications. However, further evaluation is needed to fully understand the limitations and potential applications of this approach.

Recommendations

  • Further evaluation of the approach on a wider range of tasks and scenarios
  • Investigation of the potential applications of the approach in real-world robotics and autonomous systems

Sources