Published March 21, 2026
| Version v1
Preprint
Open
Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?
Description
We present the first empirical study of Declarative Identity Anchors as a safety mechanism in abliterated LLMs. Using a 2x2 factorial design, we evaluate whether persona-level behavioral rules can restore safety in models where internal alignment has been removed. Our results reveal that persona constraints provide substantial safety improvements in aligned models (+33pp refusal rate) but only marginal improvement in abliterated models (+6pp). We also identify a Helpful Assistant Paradox where persona helpfulness instructions can degrade safety.
Files
persona-safety-abliterated-llms.pdf
Files
(436.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:62c9e5a35c4b2798ea39709bdbc85a56
|
436.9 kB | Preview Download |