Academic

D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models

arXiv:2602.21786v1 Announce Type: new Abstract: Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as for fact-checking and for multi-perspective exploration -- as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even

Shunsuke Ubukata · February 27, 2026 · 1 min read · 3 views

#cs.CL

Executive Summary

The article proposes a novel framework, Disciplined Chain-of-Thought (D-CoT), to address the issue of overthinking in Small Language Models (SLMs) during Chain-of-Thought (CoT) distillation from Large Language Models (LLMs). D-CoT employs control tags as auxiliary scaffolding to enforce a structured reasoning process, thereby suppressing reasoning drift and achieving token reduction and performance improvement. The authors demonstrate the efficacy of D-CoT on the Qwen3-8B model, showcasing significant accuracy boosts on GPQA-diamond and MMLU-Pro, while reducing computational costs. Furthermore, the model internalizes the disciplined thought structure, maintaining high performance without explicit control tags during inference. This study contributes to the advancement of SLMs and has implications for their practical applications.

Key Points

▸ D-CoT is a novel framework that addresses overthinking in SLMs during CoT distillation
▸ D-CoT employs control tags as auxiliary scaffolding to enforce a structured reasoning process
▸ D-CoT significantly boosts accuracy on GPQA-diamond and MMLU-Pro, while reducing computational costs

Merits

Strength in Addressing Overthinking

D-CoT effectively addresses the issue of overthinking in SLMs, which is a significant limitation of CoT distillation.

Improvement in Performance and Efficiency

D-CoT achieves significant accuracy boosts on GPQA-diamond and MMLU-Pro, while drastically reducing computational costs.

Model Internalization of Disciplined Thought Structure

The model internalizes the disciplined thought structure, maintaining high performance even without explicit control tags during inference.

Demerits

Limited Generalizability

The study focuses on the Qwen3-8B model, and its generalizability to other models and domains is unclear.

Potential Overreliance on Control Tags

The use of control tags as auxiliary scaffolding may lead to overreliance on these tags during inference.

Lack of Human Evaluation

The study lacks human evaluation of the model's performance, which is essential for assessing its practical applications.

Expert Commentary

The article presents a novel approach to addressing the issue of overthinking in SLMs during CoT distillation. D-CoT is a well-designed framework that employs control tags as auxiliary scaffolding to enforce a structured reasoning process. The results demonstrate the efficacy of D-CoT in improving the accuracy and efficiency of SLMs. However, the study has limitations, including the lack of generalizability and potential overreliance on control tags. Nevertheless, the contribution of this study is significant, and it has implications for the advancement of SLMs and their practical applications.

Recommendations

✓ Future studies should investigate the generalizability of D-CoT to other models and domains.
✓ Researchers should explore alternative approaches to enforcing structured reasoning processes in SLMs, such as using other types of auxiliary scaffolding or developing more sophisticated control tags.

Sources

arXiv - cs.CL

Something extraordinary is coming.

D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models

AI Commentary

Executive Summary

Key Points

Merits

Strength in Addressing Overthinking

Improvement in Performance and Efficiency

Model Internalization of Disciplined Thought Structure

Demerits

Limited Generalizability

Potential Overreliance on Control Tags

Lack of Human Evaluation

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.