Safety-Alignment Removal as a Model-Identity Failure — Structural Evidence from Published Weight-Level Mutation Checkpoints

Coslett, Anthony

doi:10.5281/zenodo.19383020

Published April 2, 2026 | Version v1

Preprint Open

Safety-Alignment Removal as a Model-Identity Failure — Structural Evidence from Published Weight-Level Mutation Checkpoints

Coslett, Anthony (Researcher)¹

1. Fall Risk Research

A deployed model can appear unchanged while ceasing to be the model it claims to be. Publicly available weight-level mutation toolchains now automate safety-alignment removal from open-weight models on ordinary hardware, producing checkpoints intended to preserve operational familiarity while discarding refusal behavior. This paper argues that safety-alignment removal is a model-identity failure: in tested published checkpoints from multiple toolchains across two model families, the mutation leaves measurable structural scars ranging from 7.6 to over 2,300 times the instrument's acceptance threshold. Artifact identity, workload identity, and agent authorization can all remain valid while structural model identity fails — a finding that the program's formally verified admissibility doctrine predicted before this threat class existed. A sentinel validation panel across four model families confirms that the hardened instrument configuration preserves or improves all tested positives. In an agentic deployment context, model-identity failure propagates upward into agent-integrity failure: the agent is authenticated, but the model inside it is no longer the model the surrounding controls were designed to govern. The practical implication is that runtime evaluation frameworks — including those emerging under the EU AI Act — implicitly depend on a model continuity that weight-level mutation can break, and that structural identity verification offers a candidate evidentiary layer for closing that gap.

The Neural Network Identity Series — Mathematical foundations, empirical validation, and governance frameworks for verifying which model is running

Paper 1: The δ-Gene: Inference-Time Physical Unclonable Functions from Architecture-Invariant Output Geometry (DOI: 10.5281/zenodo.18704275)
Paper 2: Template-Based Endpoint Verification via Logprob Order-Statistic Geometry (DOI: 10.5281/zenodo.18776711)
Paper 3: The Geometry of Model Theft: Distillation Forensics, Adversarial Erasure, and the Illusion of Spoofing (DOI: 10.5281/zenodo.18818608)
Paper 4: Provenance Generalization and Verification Scaling for Neural Network Forensics (DOI: 10.5281/zenodo.18872071)
Paper 5: Beneath the Character: The Structural Identity of Neural Networks — Mathematical Evidence for a Non-Narrative Layer of AI Identity (DOI: 10.5281/zenodo.18907292)
Paper 6: Which Model Is Running?: Structural Identity as a Prerequisite for Trustworthy Zero-Knowledge Machine Learning (DOI: 10.5281/zenodo.19008116)
Paper 7: The Deformation Laws of Neural Identity (DOI: 10.5281/zenodo.19055966)
Paper 8: What Counts as Proof? — Admissible Evidence for Neural Network Identity Claims (DOI: 10.5281/zenodo.19058540)
Paper 9: Composable Model Identity — Formal Hardening of Structural Attestations in the Enterprise Identity Stack (DOI: 10.5281/zenodo.19099911)
Paper 10:Where Identity Comes From: Path Sensitivity and Endpoint Underdetermination in Neural Network Training (DOI: 10.5281/zenodo.19118807)
Paper 11: Post-Hoc Disclosure Is Not Runtime Proof: Model Identity at Frontier Scale (DOI: 10.5281/zenodo.19216634)
Paper 12: Family-Dependent Response to Reasoning Distillation Across Structural and Functional Identity Layers (DOI: 10.5281/zenodo.19298857)
Paper 13: Safety-Alignment Removal as a Model-Identity Failure — Structural Evidence from Published Weight-Level Mutation Checkpoints (DOI: 10.5281/zenodo.19383019)

Technical Note: Agent Identity Is Not Model Identity (DOI: 10.5281/zenodo.19240883)
Technical Note: Gap Invariance: Why PPP Measurements Are Domain-Independent by Construction (DOI: 10.5281/zenodo.19275524)
Technical Note: Measured Model Substitution Under Valid Agent Credentials (DOI: 10.5281/zenodo.19342848)

Formal Verification Stack for Neural Network Structural Identity (IT-PUF Coq Proofs) (DOI: 10.5281/zenodo.18930621)

Confidential and Proprietary.

Patent Pending (Applications 63/982,893, 63/990,487, 63/996,680, 64/003,244).

Files

coslett_safety_alignment_removal.pdf

Files (375.2 kB)

Name	Size	Download all
coslett_safety_alignment_removal.pdf md5:f6e49534071eebdd592eee27b8f01c4d	375.2 kB	Preview Download

Additional details

URL: https://fallrisk.ai

Continues: Preprint: 10.5281/zenodo.18704275 (DOI); Preprint: 10.5281/zenodo.18818608 (DOI); Preprint: 10.5281/zenodo.18872071 (DOI); Preprint: 10.5281/zenodo.18907292 (DOI); Preprint: 10.5281/zenodo.19008116 (DOI); Preprint: 10.5281/zenodo.18776711 (DOI); Preprint: 10.5281/zenodo.19058540 (DOI); Preprint: 10.5281/zenodo.19058540 (DOI); Preprint: 10.5281/zenodo.19099911 (DOI); Preprint: 10.5281/zenodo.19118807 (DOI); Preprint: 10.5281/zenodo.19216634 (DOI); Preprint: 10.5281/zenodo.19298857 (DOI)
Is supplemented by: Software: 10.5281/zenodo.18930621 (DOI); Technical note: 10.5281/zenodo.19240883 (DOI); Technical note: 10.5281/zenodo.19275524 (DOI); Technical note: 10.5281/zenodo.19342848 (DOI)

	All versions	This version
Views	33	33
Downloads	19	19
Data volume	13.1 MB	13.1 MB

The Neural Network Identity Series — Mathematical foundations, empirical validation, and governance frameworks for verifying which model is running

coslett_safety_alignment_removal.pdf

Files (375.2 kB)

Identifiers

Related works

Safety-Alignment Removal as a Model-Identity Failure — Structural Evidence from Published Weight-Level Mutation Checkpoints

Authors/Creators

Description

The Neural Network Identity Series — Mathematical foundations, empirical validation, and governance frameworks for verifying which model is running

Files

coslett_safety_alignment_removal.pdf

Files (375.2 kB)

Additional details

Identifiers

Related works