Published March 21, 2026
| Version v3
Preprint
Open
Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?
Description
First empirical study of Declarative Identity Anchors (structured persona files) as a safety mechanism in abliterated LLMs. Using a 2×2 factorial design with Qwen 3.5 9B, we find persona constraints improve aligned model safety (+33pp, 50%→83%) but provide minimal protection in abliterated models (+6pp). We identify the Helpful Assistant Paradox and category-specific effectiveness patterns. v3: Gemini review feedback — tone adjustments, LLM-as-Judge limitation, Helpful Assistant Paradox quantitative support, Appendix truncation clarification, updated affiliation.
Files
persona-safety-abliterated-llms-v3.pdf
Files
(438.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:7b86ad059037ec4a1714e6fed8640d5f
|
438.8 kB | Preview Download |