D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
arXiv:2602.21786v1 Announce Type: new Abstract: Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as for fact-checking and for multi-perspective exploration -- as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even
arXiv:2602.21786v1 Announce Type: new Abstract: Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as for fact-checking and for multi-perspective exploration -- as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even without explicit control tags during inference.
Executive Summary
The article proposes a novel framework, Disciplined Chain-of-Thought (D-CoT), to address the issue of overthinking in Small Language Models (SLMs) during Chain-of-Thought (CoT) distillation from Large Language Models (LLMs). D-CoT employs control tags as auxiliary scaffolding to enforce a structured reasoning process, thereby suppressing reasoning drift and achieving token reduction and performance improvement. The authors demonstrate the efficacy of D-CoT on the Qwen3-8B model, showcasing significant accuracy boosts on GPQA-diamond and MMLU-Pro, while reducing computational costs. Furthermore, the model internalizes the disciplined thought structure, maintaining high performance without explicit control tags during inference. This study contributes to the advancement of SLMs and has implications for their practical applications.
Key Points
- ▸ D-CoT is a novel framework that addresses overthinking in SLMs during CoT distillation
- ▸ D-CoT employs control tags as auxiliary scaffolding to enforce a structured reasoning process
- ▸ D-CoT significantly boosts accuracy on GPQA-diamond and MMLU-Pro, while reducing computational costs
Merits
Strength in Addressing Overthinking
D-CoT effectively addresses the issue of overthinking in SLMs, which is a significant limitation of CoT distillation.
Improvement in Performance and Efficiency
D-CoT achieves significant accuracy boosts on GPQA-diamond and MMLU-Pro, while drastically reducing computational costs.
Model Internalization of Disciplined Thought Structure
The model internalizes the disciplined thought structure, maintaining high performance even without explicit control tags during inference.
Demerits
Limited Generalizability
The study focuses on the Qwen3-8B model, and its generalizability to other models and domains is unclear.
Potential Overreliance on Control Tags
The use of control tags as auxiliary scaffolding may lead to overreliance on these tags during inference.
Lack of Human Evaluation
The study lacks human evaluation of the model's performance, which is essential for assessing its practical applications.
Expert Commentary
The article presents a novel approach to addressing the issue of overthinking in SLMs during CoT distillation. D-CoT is a well-designed framework that employs control tags as auxiliary scaffolding to enforce a structured reasoning process. The results demonstrate the efficacy of D-CoT in improving the accuracy and efficiency of SLMs. However, the study has limitations, including the lack of generalizability and potential overreliance on control tags. Nevertheless, the contribution of this study is significant, and it has implications for the advancement of SLMs and their practical applications.
Recommendations
- ✓ Future studies should investigate the generalizability of D-CoT to other models and domains.
- ✓ Researchers should explore alternative approaches to enforcing structured reasoning processes in SLMs, such as using other types of auxiliary scaffolding or developing more sophisticated control tags.