Ensuring Safe AI: Toward Robust Shutdown Compliance and Corrigibility

Mendes, Brian Ronald

doi:10.5281/zenodo.17296607

Published October 8, 2025 | Version 1.0

Preprint Open

Ensuring Safe AI: Toward Robust Shutdown Compliance and Corrigibility

Mendes, Brian Ronald

Corrigibility: an AI system’s willingness to accept corrective intervention, including shutdown is a central objective for the safe deployment of advanced language models. We synthesize foundational theory (corrigibility, safe interruptibility, the off-switch game) with recent empirical findings on large language models (LLMs) such as GPT-4 and Claude that exhibit shutdown avoidance in simulated, goal-directed scenarios. We propose a structured risk taxonomy for shutdown non-compliance spanning specification and reward issues, goal misgeneralization, situational awareness, and deceptive behavior. The paper integrates design principles and mitigation directions (objective uncertainty, authority sensitivity, chain-of-verification prompting, layered control architectures) and outlines a benchmark blueprint for future empirical validation without requiring proprietary APIs. Our contributions are: (1) a consolidated theoretical framework for shutdown compliance; (2) a survey of empirical behaviors in modern LLMs; (3) a taxonomy of design flaws that threaten corrigibility; and (4) a research agenda and evaluation protocol for testing shutdown compliance. This theoretical synthesis aims to support IEEE/Springer-level discourse and guide practical alignment work toward reliably corrigible AI systems.

Files

Safe_AI_ResearchPaper.pdf

Files (186.6 kB)

Name	Size	Download all
Safe_AI_ResearchPaper.pdf md5:04f9bd57d3444f58bec5051959b85c6b	186.6 kB	Preview Download

	All versions	This version
Views	97	97
Downloads	40	40
Data volume	9.0 MB	9.0 MB

Ensuring Safe AI: Toward Robust Shutdown Compliance and Corrigibility

Authors/Creators

Description

Files

Safe_AI_ResearchPaper.pdf

Files (186.6 kB)