Not all tokens are needed(NAT): token efficient reinforcement learning
arXiv:2603.06619v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a key driver of progress in large language models, but scaling RL to long chain-of-thought (CoT) trajectories is increasingly constrained by backpropagation over every generated token. Even with optimized rollout engines, full-token updates can consume a large fraction of total training cost, turning token length into a hidden tax on RL. We introduce Not All Tokens Are Needed (NAT), a unified framework that makes the token budget a first-class optimization primitive. NAT updates the policy using only a selected subset of generated tokens while preserving the learning signal of full-sequence RL. The core idea is an unbiased partial-token policy-gradient estimator via Horvitz-Thompson reweighting, which ensures statistically correct gradients despite subsampling. We instantiate NAT with two simple, plug-and-play token selection schemes: Uniform Random Sampling (URS) and Random Prefix Cutting (RPC), bo
arXiv:2603.06619v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a key driver of progress in large language models, but scaling RL to long chain-of-thought (CoT) trajectories is increasingly constrained by backpropagation over every generated token. Even with optimized rollout engines, full-token updates can consume a large fraction of total training cost, turning token length into a hidden tax on RL. We introduce Not All Tokens Are Needed (NAT), a unified framework that makes the token budget a first-class optimization primitive. NAT updates the policy using only a selected subset of generated tokens while preserving the learning signal of full-sequence RL. The core idea is an unbiased partial-token policy-gradient estimator via Horvitz-Thompson reweighting, which ensures statistically correct gradients despite subsampling. We instantiate NAT with two simple, plug-and-play token selection schemes: Uniform Random Sampling (URS) and Random Prefix Cutting (RPC), both of which reduce forward and backward compute and memory without modifying the reward computation or rollout pipeline. Across mathematical reasoning benchmarks, NAT matches full-token GRPO performance while using as few as 50% of tokens, providing an efficient and orthogonal pathway to scaling RL beyond the limits imposed by long trajectories. In our experiments, RPC saves 18% peak GPU memory and 29% forward and backward RL training time for Qwen3-8B.
Executive Summary
Not All Tokens Are Needed (NAT) is a novel framework that optimizes reinforcement learning (RL) by selecting a subset of generated tokens, thereby reducing computational costs associated with long chain-of-thought trajectories. By employing Horvitz-Thompson reweighting, NAT ensures unbiased policy gradients, allowing for efficient updates of the RL policy. The framework is instantiated with two token selection schemes: Uniform Random Sampling (URS) and Random Prefix Cutting (RPC). Experiments demonstrate that NAT matches full-token GRPO performance while utilizing 50% fewer tokens, offering a scalable pathway to RL. The technique has practical implications for RL applications, particularly in large language models, by reducing training time, memory usage, and computational costs.
Key Points
- ▸ NAT is a unified framework for token-efficient reinforcement learning
- ▸ Horvitz-Thompson reweighting ensures unbiased policy gradients
- ▸ URS and RPC are simple, plug-and-play token selection schemes
- ▸ NAT matches full-token GRPO performance with 50% fewer tokens
Merits
Strength in computational efficiency
NAT's token selection schemes reduce forward and backward compute and memory usage, making it an efficient solution for RL training.
Scalability
NAT provides an orthogonal pathway to scaling RL beyond the limits imposed by long trajectories, making it a valuable addition to RL research.
Flexibility
The framework can be instantiated with various token selection schemes, allowing researchers to adapt NAT to different RL applications.
Demerits
Limitation in generalizability
The performance of NAT may be specific to certain RL tasks or environments, and its generalizability to other domains is unclear.
Potential bias in token selection
The token selection schemes used in NAT may introduce bias in the policy gradients, which could negatively impact the performance of the RL policy.
Expert Commentary
The introduction of NAT presents a promising solution for the computational efficiency challenges associated with RL. By allowing for the selection of a subset of generated tokens, NAT offers a scalable pathway to RL training. The use of Horvitz-Thompson reweighting ensures unbiased policy gradients, which is essential for the reliability and robustness of RL policies. While NAT's performance may be specific to certain RL tasks, its flexibility and orthogonality to existing RL methods make it a valuable addition to the field. Further research is needed to explore the generalizability and potential biases of NAT, as well as its application to other RL domains.
Recommendations
- ✓ Researchers should explore the generalizability of NAT to various RL tasks and environments.
- ✓ Developing more sophisticated token selection schemes that minimize bias in policy gradients is essential for the reliability of NAT.