Academic

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning

arXiv:2603.00296v1 Announce Type: new Abstract: Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty ma

arXiv:2603.00296v1 Announce Type: new Abstract: Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.

Executive Summary

The article proposes Step-wise Adaptive Penalization (SWAP), a framework that optimizes length-efficient chain-of-thought reasoning in large models. SWAP allocates length reduction across steps based on intrinsic contribution, estimated from the model's on-policy log-probability improvement. Experiments demonstrate a 64.3% reduction in reasoning length and a 5.7% improvement in accuracy relative to the base model. This approach addresses the issue of overthinking in large reasoning models, providing a more efficient and effective solution.

Key Points

  • Introduction of Step-wise Adaptive Penalization (SWAP) framework
  • Allocation of length reduction across steps based on intrinsic contribution
  • Estimation of step importance from the model's on-policy log-probability improvement

Merits

Efficient Reasoning

SWAP reduces reasoning length while improving accuracy, making it a more efficient solution for large reasoning models.

Demerits

Complexity

The SWAP framework may add complexity to the model, potentially requiring additional computational resources and expertise to implement.

Expert Commentary

The SWAP framework represents a significant advancement in optimizing length-efficient chain-of-thought reasoning. By allocating length reduction across steps based on intrinsic contribution, SWAP addresses the issue of overthinking in large models. The experimental results demonstrate the effectiveness of this approach, and its potential implications for efficient and effective reasoning models are substantial. However, further research is needed to fully explore the complexities and limitations of the SWAP framework.

Recommendations

  • Further experimentation to explore the applicability of SWAP to various domains and models
  • Investigation into the potential integration of SWAP with other optimization techniques to enhance its effectiveness

Sources