Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
arXiv:2602.13517v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substanti
arXiv:2602.13517v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.
Executive Summary
The article 'Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens' challenges the conventional approach of using generation length as a proxy for reasoning quality in large language models (LLMs). The authors introduce the concept of 'deep-thinking tokens,' which are tokens that undergo significant revisions in deeper model layers before converging. They demonstrate that the ratio of deep-thinking tokens correlates more robustly with accuracy across various benchmarks and models compared to traditional length-based or confidence-based metrics. The study also proposes a test-time scaling strategy called Think@n, which prioritizes samples with high deep-thinking ratios, thereby improving efficiency and performance.
Key Points
- ▸ Deep-thinking tokens are identified as those undergoing significant revisions in deeper model layers.
- ▸ The deep-thinking ratio correlates positively with accuracy, outperforming length-based and confidence-based baselines.
- ▸ Think@n strategy leverages deep-thinking tokens to improve inference efficiency and performance.
Merits
Innovative Metric
The introduction of deep-thinking tokens as a metric for reasoning effort is a novel and insightful approach that provides a more accurate measure of reasoning quality compared to traditional methods.
Empirical Validation
The study rigorously validates the deep-thinking ratio across multiple benchmarks and models, demonstrating its robustness and reliability.
Practical Application
The Think@n strategy offers a practical and efficient method for improving inference performance by prioritizing high-quality reasoning tokens.
Demerits
Limited Scope
The study focuses primarily on mathematical and scientific benchmarks, which may limit the generalizability of the findings to other domains.
Model Specificity
The effectiveness of deep-thinking tokens and the Think@n strategy may vary across different models and architectures, requiring further validation.
Computational Overhead
Identifying deep-thinking tokens may introduce additional computational overhead, which could offset some of the efficiency gains from the Think@n strategy.
Expert Commentary
The article presents a significant advancement in the evaluation of reasoning capabilities in large language models. By introducing the concept of deep-thinking tokens, the authors address a critical gap in the current methodologies that rely on simplistic proxies like generation length. The empirical validation across diverse benchmarks and models lends credibility to the findings, demonstrating the robustness of the deep-thinking ratio as a metric. The Think@n strategy is particularly noteworthy for its practical implications, offering a means to enhance inference efficiency without compromising on performance. However, the study's focus on specific benchmarks and models necessitates further exploration to ensure the generalizability of the results. Additionally, the potential computational overhead associated with identifying deep-thinking tokens should be carefully considered. Overall, this work contributes valuable insights to the field and paves the way for more sophisticated and efficient evaluation methods in AI.
Recommendations
- ✓ Further research should explore the applicability of deep-thinking tokens across a broader range of domains and models to validate its generalizability.
- ✓ Developers should integrate the Think@n strategy into their inference pipelines to leverage its efficiency and performance benefits.