STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
arXiv:2602.15620v1 Announce Type: new Abstract: Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-sca
arXiv:2602.15620v1 Announce Type: new Abstract: Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale model refining, which selectively masks such updates and renormalizes the loss over valid tokens. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% over GRPO, 20-Entropy and JustRL.
Executive Summary
The article 'STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens' addresses the instability issues in reinforcement learning (RL) fine-tuning for large language models (LLMs). The authors identify that a minuscule fraction of tokens, termed 'spurious tokens,' are responsible for training instability due to their amplified gradient updates. They propose Spurious-Token-Aware Policy Optimization (STAPO), a method that masks these tokens and renormalizes the loss over valid tokens. The study demonstrates significant performance improvements and stability across various benchmarks and model sizes, highlighting the effectiveness of STAPO over existing methods.
Key Points
- ▸ Identification of spurious tokens as the root cause of training instability in RL fine-tuning for LLMs.
- ▸ Proposal of STAPO, a method that selectively masks spurious tokens and renormalizes the loss.
- ▸ Demonstration of superior performance and stability of STAPO over existing methods across multiple benchmarks and model sizes.
Merits
Innovative Approach
The identification of spurious tokens and the development of STAPO represent a novel and innovative approach to addressing training instability in RL for LLMs.
Empirical Validation
The study provides robust empirical evidence supporting the effectiveness of STAPO through extensive benchmarks and comparisons with existing methods.
Practical Relevance
The findings have immediate practical relevance for researchers and practitioners working on RL fine-tuning for LLMs, offering a solution to a critical challenge in the field.
Demerits
Limited Generalizability
While the study demonstrates significant improvements, the generalizability of the findings to other types of models or tasks beyond mathematical reasoning benchmarks remains to be explored.
Computational Resources
The implementation of STAPO may require substantial computational resources, which could be a barrier for smaller research teams or organizations.
Potential Overfitting
There is a potential risk of overfitting to the specific benchmarks used in the study, which could limit the broader applicability of the method.
Expert Commentary
The article presents a rigorous and well-reasoned analysis of the challenges associated with RL fine-tuning for LLMs. The identification of spurious tokens as the root cause of training instability is a significant contribution to the field. The proposed STAPO method demonstrates a thoughtful and innovative approach to addressing this issue, supported by robust empirical evidence. The study's findings have important implications for both researchers and practitioners, offering a practical solution to a critical challenge in the development of LLMs. However, the limitations regarding generalizability and computational resources should be carefully considered. Future research could explore the applicability of STAPO to other types of models and tasks, as well as potential optimizations to reduce computational requirements. Overall, the article makes a valuable contribution to the ongoing efforts to improve the stability and effectiveness of RL fine-tuning for LLMs.
Recommendations
- ✓ Further research should be conducted to explore the generalizability of STAPO to other types of models and tasks beyond mathematical reasoning benchmarks.
- ✓ Efforts should be made to optimize the computational requirements of STAPO to make it more accessible to smaller research teams and organizations.