From Verification Failure to Swarm Solution: Measuring and Addressing Scalable AI Oversight
Description
As AI systems grow more capable, ensuring reliable human oversight becomes increasingly critical. We present a two-part investigation into scalable oversight. First, we introduce Cross-Model Epistemic Divergence (CMED), a methodology using "epistemic traps"—problems with counterintuitive correct answers—to measure verification failures. Testing GPT-4o-mini as a verifier of Claude Sonnet's reasoning, we find that while verifiers achieve approximately 97% agreement on correctly-solved problems, 20-40% of subtly flawed derivations pass verification undetected. This asymmetry reveals a fundamental limitation: single-model verification provides false confidence rather than genuine oversight. Second, we propose the Heterogeneous Divergence-Convergence Swarm (HDCS), an ensemble architecture addressing these limitations through model family diversity. By combining workers from different training lineages (Llama, Mistral, Gemma) whose errors are uncorrelated, HDCS enables error detection through disagreement. Key innovations include a baseline-first anti-anchoring protocol preventing executive models from lazily editing drafts, and structured JSON outputs enabling systematic disagreement analysis. Our work provides both a diagnostic tool for measuring oversight failures and a constructive approach to building more robust AI verification systems.