CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles
arXiv:2603.00523v1 Announce Type: new Abstract: Mechanistic circuit discovery is notoriously sensitive to arbitrary analyst choices, especially pruning thresholds and feature dictionaries, often yielding brittle "one-shot" explanations with no principled notion of uncertainty. We reframe circuit discovery as an uncertainty-quantification problem over these analytic degrees of freedom. Our method, CIRCUS, constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score (the fraction of configurations that retain it), and extracts a strict-consensus circuit consisting only of edges that appear in all views. This produces a threshold-robust "core" circuit while explicitly surfacing contingent alternatives and enabling rejection of low-agreement structure. CIRCUS requires no retraining and adds negligible overhead, since it aggregates structure across already-computed pruned graphs. On Gemma-2-2B and Ll
arXiv:2603.00523v1 Announce Type: new Abstract: Mechanistic circuit discovery is notoriously sensitive to arbitrary analyst choices, especially pruning thresholds and feature dictionaries, often yielding brittle "one-shot" explanations with no principled notion of uncertainty. We reframe circuit discovery as an uncertainty-quantification problem over these analytic degrees of freedom. Our method, CIRCUS, constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score (the fraction of configurations that retain it), and extracts a strict-consensus circuit consisting only of edges that appear in all views. This produces a threshold-robust "core" circuit while explicitly surfacing contingent alternatives and enabling rejection of low-agreement structure. CIRCUS requires no retraining and adds negligible overhead, since it aggregates structure across already-computed pruned graphs. On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are ~40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, and they outperform a same-edge-budget baseline (union pruned to match the consensus size). We further validate causal relevance with activation patching, where consensus-identified nodes consistently beat matched non-consensus controls (p=0.0004). Overall, CIRCUS provides a practical, uncertainty-aware framework for reporting trustworthy, auditable mechanistic circuits with an explicit core/contingent/noise decomposition.
Executive Summary
The article introduces CIRCUS, a method for circuit discovery that addresses the issue of uncertainty in mechanistic explanations. CIRCUS constructs an ensemble of attribution graphs and extracts a strict-consensus circuit, providing a threshold-robust 'core' circuit and surfacing contingent alternatives. This approach enables the rejection of low-agreement structure and provides a practical, uncertainty-aware framework for reporting trustworthy mechanistic circuits. The method is validated on Gemma-2-2B and Llama-3.2-1B models, demonstrating its effectiveness in retaining explanatory power while reducing circuit size.
Key Points
- ▸ CIRCUS addresses uncertainty in mechanistic explanations
- ▸ The method constructs an ensemble of attribution graphs
- ▸ A strict-consensus circuit is extracted to provide a threshold-robust 'core' circuit
Merits
Robustness to Uncertainty
CIRCUS provides a robust framework for circuit discovery, addressing the issue of uncertainty in mechanistic explanations
Efficient Computation
The method requires no retraining and adds negligible overhead, making it a practical solution
Demerits
Limited Applicability
The method may not be applicable to all types of models or datasets, requiring further validation
Interpretability Challenges
The strict-consensus circuit may not always be easily interpretable, requiring additional analysis
Expert Commentary
The introduction of CIRCUS marks a significant step forward in addressing the challenge of uncertainty in mechanistic explanations. By providing a robust and efficient framework for circuit discovery, CIRCUS has the potential to improve the transparency and trustworthiness of AI models. However, further research is needed to fully explore the applicability and interpretability of the method. The use of CIRCUS in various domains and its potential impact on regulatory standards will be an exciting area of study in the coming years.
Recommendations
- ✓ Further validation of CIRCUS on diverse models and datasets
- ✓ Investigation of the method's applicability to other explainability techniques, such as feature importance and model interpretability