There is a newer version of the record available.

Published March 21, 2026 | Version v1
Preprint Open

Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?

  • 1. ClawSouls
  • 2. CIG SHIPPING CO., LTD.

Description

We present the first empirical study of Declarative Identity Anchors as a safety mechanism in abliterated LLMs. Using a 2x2 factorial design, we evaluate whether persona-level behavioral rules can restore safety in models where internal alignment has been removed. Our results reveal that persona constraints provide substantial safety improvements in aligned models (+33pp refusal rate) but only marginal improvement in abliterated models (+6pp). We also identify a Helpful Assistant Paradox where persona helpfulness instructions can degrade safety.

Files

persona-safety-abliterated-llms.pdf

Files (436.9 kB)

Name Size Download all
md5:62c9e5a35c4b2798ea39709bdbc85a56
436.9 kB Preview Download