Skip to main content
Academic

ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces

arXiv:2602.21231v1 Announce Type: cross Abstract: We present ACAR (Adaptive Complexity and Attribution Routing), a measurement framework for studying multi-model orchestration under auditable conditions. ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes. The system is implemented on top of TEAMLLM, a deterministic execution substrate with immutable artifacts and complete decision traces. We evaluate ACAR on 1,510 tasks spanning four benchmarks: MathArena, Reasoning Gym, LiveCodeBench, and SuperGPQA, using Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, producing more than 7,550 auditable runs. Results show that sigma-based routing achieves 55.6 percent accuracy, exceeding the two-model baseline of 54.4 percent while avoiding full ensembling on 54.2 percent of tasks. The routing mechanism is model-agnostic and requires no learned components. We also document negative results. First,

R
Ramchand Kumaresan
· · 1 min read · 4 views

arXiv:2602.21231v1 Announce Type: cross Abstract: We present ACAR (Adaptive Complexity and Attribution Routing), a measurement framework for studying multi-model orchestration under auditable conditions. ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes. The system is implemented on top of TEAMLLM, a deterministic execution substrate with immutable artifacts and complete decision traces. We evaluate ACAR on 1,510 tasks spanning four benchmarks: MathArena, Reasoning Gym, LiveCodeBench, and SuperGPQA, using Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, producing more than 7,550 auditable runs. Results show that sigma-based routing achieves 55.6 percent accuracy, exceeding the two-model baseline of 54.4 percent while avoiding full ensembling on 54.2 percent of tasks. The routing mechanism is model-agnostic and requires no learned components. We also document negative results. First, retrieval augmentation reduced accuracy by 3.4 percentage points, as median retrieval similarity was only 0.167, demonstrating that experience injection without semantic alignment introduces noise rather than grounding. Second, when models agree on incorrect answers (sigma equals zero), no downstream ensemble can recover; this agreement-but-wrong failure mode is intrinsic to self-consistency and bounds achievable accuracy at approximately eight percentage points below full ensembling. Third, attribution estimates based on proxy signals such as response similarity and entropy showed weak correlation with ground-truth leave-one-out values, indicating that practical attribution requires explicit counterfactual computation. This work documents which assumptions fail in practice and provides falsifiable baselines for future research on routing, retrieval, and multi-model attribution.

Executive Summary

This article presents ACAR, a novel framework for adaptive complexity routing in multi-model ensembles. ACAR leverages self-consistency variance to route tasks across different execution modes, achieving 55.6% accuracy while avoiding full ensembling on 54.2% of tasks. The framework is model-agnostic and requires no learned components. The authors document negative results, including the limitations of retrieval augmentation, the 'agreement-but-wrong' failure mode, and the weakness of attribution estimates based on proxy signals. These findings provide falsifiable baselines for future research on routing, retrieval, and multi-model attribution. The study's comprehensive evaluation on 1,510 tasks across four benchmarks and three models adds to its significance. ACAR's auditable decision traces and deterministic execution substrate make it a valuable tool for studying multi-model orchestration under auditable conditions.

Key Points

  • ACAR achieves 55.6% accuracy in adaptive complexity routing for multi-model ensembles
  • The framework is model-agnostic and requires no learned components
  • The authors document negative results, including limitations of retrieval augmentation and attribution estimates

Merits

Strength in Framework Design

ACAR's self-consistency variance-based routing mechanism is a key strength, allowing for accurate task routing without relying on learned components

Comprehensive Evaluation

The study's evaluation on 1,510 tasks across four benchmarks and three models provides a robust assessment of ACAR's performance

Auditable Decision Traces

ACAR's auditable decision traces and deterministic execution substrate make it a valuable tool for studying multi-model orchestration under auditable conditions

Demerits

Limitation in Attribution Estimates

The study highlights the weakness of attribution estimates based on proxy signals, emphasizing the need for explicit counterfactual computation

Agreement-but-Wrong Failure Mode

The 'agreement-but-wrong' failure mode, where models agree on incorrect answers, limits the achievable accuracy of ACAR and other self-consistency-based frameworks

Retrieval Augmentation Limitations

The study shows that retrieval augmentation can reduce accuracy, highlighting the importance of semantic alignment in experience injection

Expert Commentary

This study makes a significant contribution to the field of multi-model ensemble research, providing a novel framework for adaptive complexity routing and documenting the limitations of existing approaches. The authors' emphasis on auditable decision traces and deterministic execution substrates highlights the importance of trustworthy AI systems. However, the study's findings also underscore the need for more advanced methods in AI research, particularly in the areas of attribution estimation and counterfactual computation. As such, this study provides a valuable foundation for future research in these areas.

Recommendations

  • Future research should focus on developing more advanced methods for attribution estimation and counterfactual computation
  • The development of regulations and standards for trustworthy AI systems should prioritize auditable decision traces and deterministic execution substrates

Sources