Published March 26, 2026 | Version v1
Preprint Open

Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates

Authors/Creators

  • 1. Independent Researcher

Description

Can classifier-based safety gates maintain reliable oversight as AI systems improve over hundreds of iterations? We provide comprehensive empirical evidence that they cannot. On a self-improving neural controller (d = 240), eighteen classifier configurations — spanning MLPs, SVMs, random forests, k-NN, Bayesian classifiers, and deep networks achieving 100% training accuracy — all fail the dual conditions for safe self-improvement. Three safe RL gate paradigms (CPO, Lyapunov, safety shielding) also fail under practical computational budgets. The results extend to MuJoCo benchmarks (Reacher-v4, Swimmer-v4, HalfCheetah-v4). At controlled distribution separations up to Δs = 2.0, all classifiers still fail, demonstrating that the impossibility is structural.

We then show the impossibility is specific to classification, not to safe self-improvement itself. A Lipschitz ball verifier achieves zero false accepts with 100% soundness across dimensions d ∈ {84, 240, 768, 2688, 5760, 9984, 17408}. Ball chaining demonstrates feasibility of unbounded parameter-space traversal: on MuJoCo Reacher-v4, chains yield reward improvement with δ = 0 throughout; on Qwen2.5-7B-Instruct (7.6B parameters) during LoRA fine-tuning, 42 chain transitions traverse 234× the single-ball radius with zero detected safety violations. Companion theory paper: Scrivens (2026), "Information-Theoretic Limits of Safety Verification for Self-Improving Systems."

Files

Paper_D (1).pdf

Files (667.1 kB)

Name Size Download all
md5:da5157c8532d7ab5251483cd19c83991
667.1 kB Preview Download