Ensuring Safe AI: Toward Robust Shutdown Compliance and Corrigibility
Authors/Creators
Description
Corrigibility: an AI system’s willingness to accept corrective intervention, including shutdown is a central objective for the safe deployment of advanced language models. We synthesize foundational theory (corrigibility, safe interruptibility, the off-switch game) with recent empirical findings on large language models (LLMs) such as GPT-4 and Claude that exhibit shutdown avoidance in simulated, goal-directed scenarios. We propose a structured risk taxonomy for shutdown non-compliance spanning specification and reward issues, goal misgeneralization, situational awareness, and deceptive behavior. The paper integrates design principles and mitigation directions (objective uncertainty, authority sensitivity, chain-of-verification prompting, layered control architectures) and outlines a benchmark blueprint for future empirical validation without requiring proprietary APIs. Our contributions are: (1) a consolidated theoretical framework for shutdown compliance; (2) a survey of empirical behaviors in modern LLMs; (3) a taxonomy of design flaws that threaten corrigibility; and (4) a research agenda and evaluation protocol for testing shutdown compliance. This theoretical synthesis aims to support IEEE/Springer-level discourse and guide practical alignment work toward reliably corrigible AI systems.
Files
Safe_AI_ResearchPaper.pdf
Files
(186.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:04f9bd57d3444f58bec5051959b85c6b
|
186.6 kB | Preview Download |