Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?

Lee, Tom Jaejoon; Lee, Jihong

doi:10.5281/zenodo.19145304

There is a newer version of the record available.

Published March 21, 2026 | Version v1

Preprint Open

Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?

1. ClawSouls
2. CIG SHIPPING CO., LTD.

We present the first empirical study of Declarative Identity Anchors as a safety mechanism in abliterated LLMs. Using a 2x2 factorial design, we evaluate whether persona-level behavioral rules can restore safety in models where internal alignment has been removed. Our results reveal that persona constraints provide substantial safety improvements in aligned models (+33pp refusal rate) but only marginal improvement in abliterated models (+6pp). We also identify a Helpful Assistant Paradox where persona helpfulness instructions can degrade safety.

Files

persona-safety-abliterated-llms.pdf

Files (436.9 kB)

Name	Size	Download all
persona-safety-abliterated-llms.pdf md5:62c9e5a35c4b2798ea39709bdbc85a56	436.9 kB	Preview Download

129

Views

Downloads

Show more details

	All versions	This version
Views	129	35
Downloads	85	21
Data volume	46.9 MB	12.2 MB

More info on how stats are collected....

DOI

Resource type

Preprint

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: March 21, 2026
Modified: March 21, 2026

Persona-Level Safety in Abliterated LLMs: Can Declarative Identity Anchors Defend When Model Guardrails Are Gone?

Authors/Creators

Description

Files

persona-safety-abliterated-llms.pdf

Files (436.9 kB)