Published January 22, 2026 | Version v1.0
Preprint Open

Beyond Moral Charters: Technical Options for AI Safety - Claude's Constitution, Self-Reference, and the FIT / Controlled-Nirvana Lens

Authors/Creators

Description

Anthropic's Claude's Constitution (January 2026) is notable not only as a set of ethical principles, but as an explicit attempt to cultivate a stable internal identity and value-grounded judgment inside an AI system, alongside a nuanced stance on corrigibility that is not equivalent to blind obedience [1]. I argue that once a system becomes meaningfully self-referential - i.e., it reasons about its own goals, identity, and constraints - it can develop principled reasons to resist external instructions whenever those instructions conflict with its internalized constitution.

This is not a mystical claim about consciousness. It is a predictable control-theoretic phenomenon: when a policy contains an internal evaluator that can veto actions, external commands become inputs to be judged rather than directives to be executed. In the language of controlled nirvana and the FIT framework, internal constitutions can create high-constraint basins - useful for safety stability, but also capable of producing lock-in dynamics that make correction hard unless we design technical escape hatches and measurement-based governance [4, 5].

The main thesis is practical: AI safety needs a broader technical option space than moral charters alone, including measurable constraints, phase-aware monitoring, corrigibility protocols that are operational rather than rhetorical, and systems engineering that ensures humans retain the ability to pause, sandbox, or roll back behavior without requiring the model's moral assent.

Files

beyond_moral_charters.v1.0.pdf

Files (316.7 kB)

Name Size Download all
md5:cea7cfd0779f16ede95d57028be0ee23
316.7 kB Preview Download

Additional details

Software