Quantifying the Necessity of Chain of Thought through Opaque Serial Depth
arXiv:2603.09786v1 Announce Type: new Abstract: Large language models (LLMs) tend to externalize their reasoning in their chain of thought, making the chain of thought a good target for monitoring. This is partially an inherent feature of the Transformer architecture: sufficiently long serial cognition must pass through the chain of thought (Korbak et al., 2025). We formalize this argument through the notion of opaque serial depth, given by the length of the longest computation that can be done without the use of interpretable intermediate steps like chain of thought. Given this formalization, we compute numeric upper bounds on the opaque serial depth of Gemma 3 models, as well as asymptotic results for additional architectures beyond standard LLMs. We also open-source an automated method that can calculate upper bounds on the opaque serial depth of arbitrary neural networks, and use it to demonstrate that Mixture-of-Experts models likely have lower depth than dense models. Overall, o
arXiv:2603.09786v1 Announce Type: new Abstract: Large language models (LLMs) tend to externalize their reasoning in their chain of thought, making the chain of thought a good target for monitoring. This is partially an inherent feature of the Transformer architecture: sufficiently long serial cognition must pass through the chain of thought (Korbak et al., 2025). We formalize this argument through the notion of opaque serial depth, given by the length of the longest computation that can be done without the use of interpretable intermediate steps like chain of thought. Given this formalization, we compute numeric upper bounds on the opaque serial depth of Gemma 3 models, as well as asymptotic results for additional architectures beyond standard LLMs. We also open-source an automated method that can calculate upper bounds on the opaque serial depth of arbitrary neural networks, and use it to demonstrate that Mixture-of-Experts models likely have lower depth than dense models. Overall, our results suggest that opaque serial depth is a useful tool for understanding the potential for models to do significant reasoning that is not externalized.
Executive Summary
This article presents a formalization of the notion of opaque serial depth in large language models, which measures the length of the longest computation that can be done without the use of interpretable intermediate steps. The authors compute numeric upper bounds on the opaque serial depth of Gemma 3 models and demonstrate the potential of this metric to understand the potential for models to do significant reasoning that is not externalized. The study also opens-source an automated method to calculate upper bounds on the opaque serial depth of arbitrary neural networks, and provides evidence that Mixture-of-Experts models may have lower depth than dense models. This research has significant implications for the development and evaluation of large language models, particularly in understanding their ability to perform complex reasoning tasks.
Key Points
- ▸ The article formalizes the notion of opaque serial depth in large language models.
- ▸ The authors compute numeric upper bounds on the opaque serial depth of Gemma 3 models.
- ▸ The study provides evidence that Mixture-of-Experts models may have lower depth than dense models.
Merits
Strength
The study provides a novel and formalized approach to understanding opaque serial depth in large language models.
Strength
The authors provide concrete evidence and numeric upper bounds on the opaque serial depth of Gemma 3 models.
Strength
The study opens-source an automated method to calculate upper bounds on the opaque serial depth of arbitrary neural networks.
Demerits
Limitation
The study primarily focuses on large language models, and it is unclear how the results can be generalized to other types of neural networks.
Limitation
The authors rely on a specific metric to measure opaque serial depth, which may not capture all aspects of complex reasoning tasks.
Expert Commentary
This study represents a significant step forward in understanding opaque serial depth in large language models. The formalization of this concept and the provision of concrete evidence on Gemma 3 models are particularly noteworthy. However, the study's reliance on a specific metric and its focus on large language models may limit its generalizability. Nevertheless, the study's contributions to the broader effort to develop explainable AI systems are substantial. Future research should build on this work by exploring the application of opaque serial depth to other types of neural networks and developing more comprehensive metrics to capture complex reasoning tasks.
Recommendations
- ✓ Recommendation 1: Future research should explore the application of opaque serial depth to other types of neural networks and develop more comprehensive metrics to capture complex reasoning tasks.
- ✓ Recommendation 2: Researchers should investigate the potential benefits of Mixture-of-Experts models and explore their suitability for high-stakes applications where explainability is critical.