ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces
Authors/Creators
Description
We present ACAR (Adaptive Complexity & Attribution Routing) as a measurement framework for studying multi-model orchestration under auditable conditions.
ACAR uses self-consistency variance (σ) computed from N=3 probe samples to
route tasks across single-model, two-model, and three-model execution modes,
implemented atop TEAMLLM, a deterministic substrate with immutable artifacts
and complete decision traces. We evaluate across 1,510 tasks spanning four benchmarks (MathArena, Reasoning Gym, LiveCodeBench, SuperGPQA) with Claude
Sonnet 4, GPT-4o, and Gemini 2.0 Flash, producing 7,550+ auditable runs. What
holds: σ-based routing achieves 55.6% accuracy, exceeding the two-model baseline
(54.4%) while avoiding full ensembling on 54.2% of tasks; the mechanism is modelagnostic and requires no learned components. What does not hold: (1) Retrieval
augmentation decreased accuracy by 3.4 percentage points—median retrieval similarity was only 0.167, demonstrating that experience injection without semantic
alignment introduces harmful noise rather than grounding. (2) When models agree
on incorrect answers (σ=0), no downstream ensemble can recover; this “agreementbut-wrong” failure mode is intrinsic to self-consistency and bounds achievable
accuracy at 8pp below full ensembling. (3) Attribution estimates based on proxy
signals (response similarity, entropy) showed weak correlation with ground-truth
leave-one-out values; practical attribution requires explicit counterfactual computation. This paper documents what assumptions fail in practice, providing falsifiable
baselines for future work on routing, retrieval, and multi-model attribution.
Files
acar_neurips.pdf
Files
(403.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:316c4605dcf5655d6cecf34527292441
|
403.2 kB | Preview Download |
Additional details
Dates
- Submitted
-
2026-01-30