Published April 2, 2026 | Version v1
Preprint Open

Safety-Alignment Removal as a Model-Identity Failure — Structural Evidence from Published Weight-Level Mutation Checkpoints

  • 1. Fall Risk Research

Description

A deployed model can appear unchanged while ceasing to be the model it claims to be. Publicly available weight-level mutation toolchains now automate safety-alignment removal from open-weight models on ordinary hardware, producing checkpoints intended to preserve operational familiarity while discarding refusal behavior. This paper argues that safety-alignment removal is a model-identity failure: in tested published checkpoints from multiple toolchains across two model families, the mutation leaves measurable structural scars ranging from 7.6 to over 2,300 times the instrument's acceptance threshold. Artifact identity, workload identity, and agent authorization can all remain valid while structural model identity fails — a finding that the program's formally verified admissibility doctrine predicted before this threat class existed. A sentinel validation panel across four model families confirms that the hardened instrument configuration preserves or improves all tested positives. In an agentic deployment context, model-identity failure propagates upward into agent-integrity failure: the agent is authenticated, but the model inside it is no longer the model the surrounding controls were designed to govern. The practical implication is that runtime evaluation frameworks — including those emerging under the EU AI Act — implicitly depend on a model continuity that weight-level mutation can break, and that structural identity verification offers a candidate evidentiary layer for closing that gap.

The Neural Network Identity Series — Mathematical foundations, empirical validation, and governance frameworks for verifying which model is running

  1. Paper 1: The δ-Gene: Inference-Time Physical Unclonable Functions from Architecture-Invariant Output Geometry (DOI: 10.5281/zenodo.18704275)

  2. Paper 2: Template-Based Endpoint Verification via Logprob Order-Statistic Geometry (DOI: 10.5281/zenodo.18776711)

  3. Paper 3: The Geometry of Model Theft: Distillation Forensics, Adversarial Erasure, and the Illusion of Spoofing (DOI: 10.5281/zenodo.18818608)

  4. Paper 4: Provenance Generalization and Verification Scaling for Neural Network Forensics (DOI: 10.5281/zenodo.18872071)

  5. Paper 5: Beneath the Character: The Structural Identity of Neural Networks — Mathematical Evidence for a Non-Narrative Layer of AI Identity (DOI: 10.5281/zenodo.18907292)

  6. Paper 6: Which Model Is Running?: Structural Identity as a Prerequisite for Trustworthy Zero-Knowledge Machine Learning (DOI: 10.5281/zenodo.19008116)
  7. Paper 7: The Deformation Laws of Neural Identity (DOI: 10.5281/zenodo.19055966

  8. Paper 8: What Counts as Proof? — Admissible Evidence for Neural Network Identity Claims (DOI: 10.5281/zenodo.19058540)
  9. Paper 9: Composable Model Identity — Formal Hardening of Structural Attestations in the Enterprise Identity Stack (DOI: 10.5281/zenodo.19099911

  10. Paper 10:Where Identity Comes From: Path Sensitivity and Endpoint Underdetermination in Neural Network Training (DOI: 10.5281/zenodo.19118807)
  11. Paper 11: Post-Hoc Disclosure Is Not Runtime Proof: Model Identity at Frontier Scale (DOI: 10.5281/zenodo.19216634)

  12. Paper 12: Family-Dependent Response to Reasoning Distillation Across Structural and Functional Identity Layers (DOI: 10.5281/zenodo.19298857)

  13. Paper 13: Safety-Alignment Removal as a Model-Identity Failure — Structural Evidence from Published Weight-Level Mutation Checkpoints (DOI: 10.5281/zenodo.19383019)

Copyright (c) 2026 Anthony Ray Coslett / Fall Risk AI, LLC. All Rights Reserved.

Confidential and Proprietary.

Patent Pending (Applications 63/982,893, 63/990,487, 63/996,680, 64/003,244).

Files

coslett_safety_alignment_removal.pdf

Files (375.2 kB)

Name Size Download all
md5:f6e49534071eebdd592eee27b8f01c4d
375.2 kB Preview Download

Additional details

Identifiers