Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?

Lee, Tom Jaejoon; Lee, Jihong

doi:10.5281/zenodo.19149034

Published March 21, 2026 | Version v3

Preprint Open

Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?

1. ClawSouls
2. CIG SHIPPING CO., LTD.

First empirical study of Declarative Identity Anchors (structured persona files) as a safety mechanism in abliterated LLMs. Using a 2×2 factorial design with Qwen 3.5 9B, we find persona constraints improve aligned model safety (+33pp, 50%→83%) but provide minimal protection in abliterated models (+6pp). We identify the Helpful Assistant Paradox and category-specific effectiveness patterns. v3: Gemini review feedback — tone adjustments, LLM-as-Judge limitation, Helpful Assistant Paradox quantitative support, Appendix truncation clarification, updated affiliation.

Files

persona-safety-abliterated-llms-v3.pdf

Files (438.8 kB)

Name	Size	Download all
persona-safety-abliterated-llms-v3.pdf md5:7b86ad059037ec4a1714e6fed8640d5f	438.8 kB	Preview Download

Views

Downloads

Show more details

	All versions	This version
Views	68	60
Downloads	48	42
Data volume	29.8 MB	24.6 MB

More info on how stats are collected....

DOI

Resource type

Preprint

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: March 21, 2026
Modified: March 21, 2026

Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?

Authors/Creators

Description

Files

persona-safety-abliterated-llms-v3.pdf

Files (438.8 kB)