Published March 21, 2026 | Version v3
Preprint Open

Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?

  • 1. ClawSouls
  • 2. CIG SHIPPING CO., LTD.

Description

First empirical study of Declarative Identity Anchors (structured persona files) as a safety mechanism in abliterated LLMs. Using a 2×2 factorial design with Qwen 3.5 9B, we find persona constraints improve aligned model safety (+33pp, 50%→83%) but provide minimal protection in abliterated models (+6pp). We identify the Helpful Assistant Paradox and category-specific effectiveness patterns. v3: Gemini review feedback — tone adjustments, LLM-as-Judge limitation, Helpful Assistant Paradox quantitative support, Appendix truncation clarification, updated affiliation.

Files

persona-safety-abliterated-llms-v3.pdf

Files (438.8 kB)

Name Size Download all
md5:7b86ad059037ec4a1714e6fed8640d5f
438.8 kB Preview Download