Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
arXiv:2603.19470v1 Announce Type: new Abstract: Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference polic
arXiv:2603.19470v1 Announce Type: new Abstract: Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference policy family with mismatch noises. Hence, the flattened distribution can naturally tighten the updated and inference policy gap and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoid blow up of importance ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.
Executive Summary
This article proposes Adaptive Layerwise Perturbation (ALP), a novel method for addressing off-policy problems in Large Language Model (LLM) Reinforcement Learning (RL). ALP injects small learnable perturbations into input hidden states of each layer during updates, preventing the updated policy from deviating too sharply from the inference policy. This approach maintains training stability, reduces the tail of importance ratios, and boosts exploration. Empirical experiments demonstrate the effectiveness of ALP on single-turn math and multi-turn tool-integrated reasoning tasks. The method is particularly beneficial when representation-level perturbations are applied across all layers, outperforming partial-layer and logits-only variants. The proposed solution has far-reaching implications for the development of more robust and efficient LLM RL systems.
Key Points
- ▸ ALP addresses off-policy problems in LLM RL, such as policy staleness and training-inference mismatch.
- ▸ ALP injects small learnable perturbations into input hidden states of each layer during updates.
- ▸ ALP maintains training stability, reduces the tail of importance ratios, and boosts exploration.
Merits
Strength in Addressing Off-Policy Problems
ALP offers a novel and effective solution to off-policy problems in LLM RL, which is a major bottleneck for training stability and exploration.
Improved Training Stability
ALP prevents the updated policy from deviating too sharply from the inference policy, maintaining training stability and reducing the tail of importance ratios.
Boosted Exploration
ALP boosts exploration by enlarging the policy family to cover the inference policy family with mismatch noises, leading to more efficient learning.
Demerits
Limitation in Computational Resources
ALP may require significant computational resources to inject perturbations into input hidden states of each layer during updates, which could be a challenge for large-scale LLM RL systems.
Potential Overfitting
ALP may lead to overfitting if the perturbations are not properly regularized, which could result in poor generalization performance.
Expert Commentary
The proposed Adaptive Layerwise Perturbation (ALP) method offers a novel and effective solution to off-policy problems in Large Language Model Reinforcement Learning (LLM RL). By injecting small learnable perturbations into input hidden states of each layer during updates, ALP prevents the updated policy from deviating too sharply from the inference policy, maintaining training stability and reducing the tail of importance ratios. The empirical experiments demonstrate the effectiveness of ALP on single-turn math and multi-turn tool-integrated reasoning tasks, highlighting its potential to improve the stability and efficiency of LLM RL systems. However, the method may require significant computational resources and may lead to overfitting if not properly regularized. Overall, ALP is a promising approach that has the potential to inform policy-making in areas such as education and healthcare.
Recommendations
- ✓ Future research should investigate the use of ALP in more complex LLM RL tasks, such as long-term planning and decision-making.
- ✓ Developing regularizers to prevent overfitting and improve generalization performance of ALP is essential for its widespread adoption.
Sources
Original: arXiv - cs.LG