Published January 13, 2026 | Version 1.0
Preprint Open

From Verification Failure to Swarm Solution: Measuring and Addressing Scalable AI Oversight

  • 1. None

Description

As AI systems grow more capable, ensuring reliable human oversight becomes increasingly critical. We present a two-part investigation into scalable oversight. First, we introduce Cross-Model Epistemic Divergence (CMED), a methodology using "epistemic traps"—problems with counterintuitive correct answers—to measure verification failures. Testing GPT-4o-mini as a verifier of Claude Sonnet's reasoning, we find that while verifiers achieve approximately 97% agreement on correctly-solved problems, 20-40% of subtly flawed derivations pass verification undetected. This asymmetry reveals a fundamental limitation: single-model verification provides false confidence rather than genuine oversight. Second, we propose the Heterogeneous Divergence-Convergence Swarm (HDCS), an ensemble architecture addressing these limitations through model family diversity. By combining workers from different training lineages (Llama, Mistral, Gemma) whose errors are uncorrelated, HDCS enables error detection through disagreement. Key innovations include a baseline-first anti-anchoring protocol preventing executive models from lazily editing drafts, and structured JSON outputs enabling systematic disagreement analysis. Our work provides both a diagnostic tool for measuring oversight failures and a constructive approach to building more robust AI verification systems.

Files

scalable_oversight_paper.pdf

Files (445.8 kB)

Name Size Download all
md5:9724c6bc98de1485c1f9f97e9c1ccaff
3.8 kB Download
md5:e9d2976ca05d7c9bbf6018f71e903267
6.4 kB Download
md5:5c641e1d48511cd17b3cf387bbc15b58
398.8 kB Preview Download
md5:fc0584f8792f3b06a64d159f242c1b64
36.8 kB Download