Academic

Constraint-Rectified Training for Efficient Chain-of-Thought

arXiv:2602.12526v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), especially when combined with reinforcement learning (RL) based post-training methods. While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking. Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy, either through length-aware reward design or prompt-based calibration. However, these heuristic-based approaches may suffer from severe accuracy drop and be very sensitive to hyperparameters. To address these problems, we introduce CRT (Constraint-Rectified Training), a principled post-training framework based on reference-guarded constrained optimization, yielding a more stable and interpretable formulation for efficient reasoning. CRT alt

arXiv:2602.12526v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), especially when combined with reinforcement learning (RL) based post-training methods. While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking. Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy, either through length-aware reward design or prompt-based calibration. However, these heuristic-based approaches may suffer from severe accuracy drop and be very sensitive to hyperparameters. To address these problems, we introduce CRT (Constraint-Rectified Training), a principled post-training framework based on reference-guarded constrained optimization, yielding a more stable and interpretable formulation for efficient reasoning. CRT alternates between minimizing reasoning length and rectifying accuracy only when performance falls below the reference, enabling stable and effective pruning of redundant reasoning. We further extend CRT with a two-stage training scheme that first discovers the shortest reliable reasoning patterns and then refines accuracy under a learnt length budget, preventing the re-emergence of verbose CoT. Our comprehensive evaluation shows that this framework consistently reduces token usage while maintaining answer quality at a robust and reliable level. Further analysis reveals that CRT improves reasoning efficiency not only by shortening responses but also by reducing internal language redundancy, leading to a new evaluation metric. Moreover, CRT-based training naturally yields a sequence of intermediate checkpoints that span a spectrum of explanation lengths while preserving correctness, enabling fine-grained control over reasoning verbosity without retraining.

Executive Summary

The article introduces Constraint-Rectified Training (CRT), a novel post-training framework designed to enhance the efficiency of Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs). CRT addresses the challenge of overthinking by employing a constraint-optimization approach that balances reasoning length and accuracy. The framework alternates between minimizing reasoning length and rectifying accuracy, ensuring stable and effective pruning of redundant reasoning steps. Comprehensive evaluations demonstrate that CRT significantly reduces token usage while maintaining high answer quality, offering a robust and reliable solution for efficient reasoning in LLMs.

Key Points

  • CRT is a principled post-training framework that optimizes reasoning efficiency in LLMs.
  • The framework alternates between minimizing reasoning length and rectifying accuracy to prevent overthinking.
  • CRT employs a two-stage training scheme to discover the shortest reliable reasoning patterns and refine accuracy under a learned length budget.
  • Evaluations show that CRT reduces token usage while maintaining high answer quality.
  • CRT-based training yields intermediate checkpoints that allow fine-grained control over reasoning verbosity without retraining.

Merits

Principled Approach

CRT offers a principled and interpretable formulation for efficient reasoning, addressing the limitations of heuristic-based approaches.

Robust Performance

The framework ensures stable and effective pruning of redundant reasoning steps, maintaining high answer quality while reducing token usage.

Fine-Grained Control

CRT-based training produces intermediate checkpoints that allow for fine-grained control over reasoning verbosity, enhancing the flexibility of the model.

Demerits

Complexity

The two-stage training scheme and constraint-optimization approach may introduce additional complexity in implementation and training processes.

Hyperparameter Sensitivity

While CRT aims to reduce sensitivity to hyperparameters, the effectiveness of the framework may still be influenced by the choice of hyperparameters.

Expert Commentary

The introduction of Constraint-Rectified Training (CRT) represents a significant advancement in the field of efficient reasoning for Large Language Models (LLMs). By addressing the critical issue of overthinking, CRT provides a robust and interpretable framework that balances reasoning length and accuracy. The two-stage training scheme and constraint-optimization approach offer a principled solution that outperforms heuristic-based methods, ensuring stable and effective pruning of redundant reasoning steps. The framework's ability to produce intermediate checkpoints further enhances its practical utility, allowing for fine-grained control over reasoning verbosity without the need for retraining. While the complexity of the approach and potential sensitivity to hyperparameters are notable limitations, the overall benefits of CRT in improving reasoning efficiency and maintaining high answer quality are substantial. This research not only contributes to the ongoing efforts to optimize LLMs but also sets a new standard for efficient reasoning strategies in AI.

Recommendations

  • Further research should explore the scalability of CRT to different types of LLMs and reasoning tasks.
  • Investigations into the impact of various hyperparameters on CRT's performance could provide valuable insights for optimization.

Sources