Academic

Not all tokens are needed(NAT): token efficient reinforcement learning

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang · March 10, 2026 · 1 min read · 24 views

#cs.LG #cs.AI

arXiv:2603.06619v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a key driver of progress in large language models, but scaling RL to long chain-of-thought (CoT) trajectories is increasingly constrained by backpropagation over every generated token. Even with optimized rollout engines, full-token updates can consume a large fraction of total training cost, turning token length into a hidden tax on RL. We introduce Not All Tokens Are Needed (NAT), a unified framework that makes the token budget a first-class optimization primitive. NAT updates the policy using only a selected subset of generated tokens while preserving the learning signal of full-sequence RL. The core idea is an unbiased partial-token policy-gradient estimator via Horvitz-Thompson reweighting, which ensures statistically correct gradients despite subsampling. We instantiate NAT with two simple, plug-and-play token selection schemes: Uniform Random Sampling (URS) and Random Prefix Cutting (RPC), both of which reduce forward and backward compute and memory without modifying the reward computation or rollout pipeline. Across mathematical reasoning benchmarks, NAT matches full-token GRPO performance while using as few as 50% of tokens, providing an efficient and orthogonal pathway to scaling RL beyond the limits imposed by long trajectories. In our experiments, RPC saves 18% peak GPU memory and 29% forward and backward RL training time for Qwen3-8B.

Executive Summary

Not All Tokens Are Needed (NAT) is a novel framework that optimizes reinforcement learning (RL) by selecting a subset of generated tokens, thereby reducing computational costs associated with long chain-of-thought trajectories. By employing Horvitz-Thompson reweighting, NAT ensures unbiased policy gradients, allowing for efficient updates of the RL policy. The framework is instantiated with two token selection schemes: Uniform Random Sampling (URS) and Random Prefix Cutting (RPC). Experiments demonstrate that NAT matches full-token GRPO performance while utilizing 50% fewer tokens, offering a scalable pathway to RL. The technique has practical implications for RL applications, particularly in large language models, by reducing training time, memory usage, and computational costs.

Key Points

▸ NAT is a unified framework for token-efficient reinforcement learning
▸ Horvitz-Thompson reweighting ensures unbiased policy gradients
▸ URS and RPC are simple, plug-and-play token selection schemes
▸ NAT matches full-token GRPO performance with 50% fewer tokens

Merits

Strength in computational efficiency

NAT's token selection schemes reduce forward and backward compute and memory usage, making it an efficient solution for RL training.

Scalability

NAT provides an orthogonal pathway to scaling RL beyond the limits imposed by long trajectories, making it a valuable addition to RL research.

Flexibility

The framework can be instantiated with various token selection schemes, allowing researchers to adapt NAT to different RL applications.

Demerits

Limitation in generalizability

The performance of NAT may be specific to certain RL tasks or environments, and its generalizability to other domains is unclear.

Potential bias in token selection

The token selection schemes used in NAT may introduce bias in the policy gradients, which could negatively impact the performance of the RL policy.

Expert Commentary

The introduction of NAT presents a promising solution for the computational efficiency challenges associated with RL. By allowing for the selection of a subset of generated tokens, NAT offers a scalable pathway to RL training. The use of Horvitz-Thompson reweighting ensures unbiased policy gradients, which is essential for the reliability and robustness of RL policies. While NAT's performance may be specific to certain RL tasks, its flexibility and orthogonality to existing RL methods make it a valuable addition to the field. Further research is needed to explore the generalizability and potential biases of NAT, as well as its application to other RL domains.

Recommendations

✓ Researchers should explore the generalizability of NAT to various RL tasks and environments.
✓ Developing more sophisticated token selection schemes that minimize bias in policy gradients is essential for the reliability of NAT.

Sources

arXiv - cs.LG

Not all tokens are needed(NAT): token efficient reinforcement learning

AI Commentary

Executive Summary

Key Points

Merits

Strength in computational efficiency

Scalability

Flexibility

Demerits

Limitation in generalizability

Potential bias in token selection

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs