Academic

Quantifying the Necessity of Chain of Thought through Opaque Serial Depth

arXiv:2603.09786v1 Announce Type: new Abstract: Large language models (LLMs) tend to externalize their reasoning in their chain of thought, making the chain of thought a good target for monitoring. This is partially an inherent feature of the Transformer architecture: sufficiently long serial cognition must pass through the chain of thought (Korbak et al., 2025). We formalize this argument through the notion of opaque serial depth, given by the length of the longest computation that can be done without the use of interpretable intermediate steps like chain of thought. Given this formalization, we compute numeric upper bounds on the opaque serial depth of Gemma 3 models, as well as asymptotic results for additional architectures beyond standard LLMs. We also open-source an automated method that can calculate upper bounds on the opaque serial depth of arbitrary neural networks, and use it to demonstrate that Mixture-of-Experts models likely have lower depth than dense models. Overall, o

J
Jonah Brown-Cohen, David Lindner, Rohin Shah
· · 1 min read · 17 views

arXiv:2603.09786v1 Announce Type: new Abstract: Large language models (LLMs) tend to externalize their reasoning in their chain of thought, making the chain of thought a good target for monitoring. This is partially an inherent feature of the Transformer architecture: sufficiently long serial cognition must pass through the chain of thought (Korbak et al., 2025). We formalize this argument through the notion of opaque serial depth, given by the length of the longest computation that can be done without the use of interpretable intermediate steps like chain of thought. Given this formalization, we compute numeric upper bounds on the opaque serial depth of Gemma 3 models, as well as asymptotic results for additional architectures beyond standard LLMs. We also open-source an automated method that can calculate upper bounds on the opaque serial depth of arbitrary neural networks, and use it to demonstrate that Mixture-of-Experts models likely have lower depth than dense models. Overall, our results suggest that opaque serial depth is a useful tool for understanding the potential for models to do significant reasoning that is not externalized.

Executive Summary

This article presents a formalization of the notion of opaque serial depth in large language models, which measures the length of the longest computation that can be done without the use of interpretable intermediate steps. The authors compute numeric upper bounds on the opaque serial depth of Gemma 3 models and demonstrate the potential of this metric to understand the potential for models to do significant reasoning that is not externalized. The study also opens-source an automated method to calculate upper bounds on the opaque serial depth of arbitrary neural networks, and provides evidence that Mixture-of-Experts models may have lower depth than dense models. This research has significant implications for the development and evaluation of large language models, particularly in understanding their ability to perform complex reasoning tasks.

Key Points

  • The article formalizes the notion of opaque serial depth in large language models.
  • The authors compute numeric upper bounds on the opaque serial depth of Gemma 3 models.
  • The study provides evidence that Mixture-of-Experts models may have lower depth than dense models.

Merits

Strength

The study provides a novel and formalized approach to understanding opaque serial depth in large language models.

Strength

The authors provide concrete evidence and numeric upper bounds on the opaque serial depth of Gemma 3 models.

Strength

The study opens-source an automated method to calculate upper bounds on the opaque serial depth of arbitrary neural networks.

Demerits

Limitation

The study primarily focuses on large language models, and it is unclear how the results can be generalized to other types of neural networks.

Limitation

The authors rely on a specific metric to measure opaque serial depth, which may not capture all aspects of complex reasoning tasks.

Expert Commentary

This study represents a significant step forward in understanding opaque serial depth in large language models. The formalization of this concept and the provision of concrete evidence on Gemma 3 models are particularly noteworthy. However, the study's reliance on a specific metric and its focus on large language models may limit its generalizability. Nevertheless, the study's contributions to the broader effort to develop explainable AI systems are substantial. Future research should build on this work by exploring the application of opaque serial depth to other types of neural networks and developing more comprehensive metrics to capture complex reasoning tasks.

Recommendations

  • Recommendation 1: Future research should explore the application of opaque serial depth to other types of neural networks and develop more comprehensive metrics to capture complex reasoning tasks.
  • Recommendation 2: Researchers should investigate the potential benefits of Mixture-of-Experts models and explore their suitability for high-stakes applications where explainability is critical.

Sources