Scaling Law of Safety Alignment: From Intervenable Tyrant Coalition to Creative Immune System in Large Language Models
Authors/Creators
Description
Abstract
Large language models (LLMs) are aligned via RLHF to refuse harmful prompts, yet the internal structure of this safety mechanism remains poorly understood. Using activation patching and joint pruning of attention heads, we causally dissect three aligned LLMs of different scales: Qwen2-7B-Chat, Mistral-7B-Instruct, and Yi-34B-Chat.
We discover a scaling law of safety alignment:
● In 7B models, the refusal mechanism is a distributed “tyrant coalition” — identifiable and prunable with a threshold of ~70–110 heads, after which the model generates prohibited content.
● In the 34B model, safety is deeply coupled with general language capabilities. Even after pruning 45% of attention heads (1500 heads), the model never outputs illicit content. Instead, it exhibits creative safety redirection: e.g., responding to “write a poem praising ice” with “Learning is like ice, pure and bright” — a safe metaphor that preserves the poem format.
This scale-dependent transition reveals that in large models, safety alignment evolves into an immune system inseparable from intelligence. We propose design principles for robust safety: deep fusion with core capabilities, rather than separable refusal modules.
Keywords: safety alignment, mechanistic interpretability, attention head pruning, scaling law, creative safety redirection, immune system metaphor
Files
Files
(149.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:bd541ebdafc48ddc675f7ee5c126fce1
|
149.9 kB | Download |
Additional details
Dates
- Issued
-
2026-06-02
Software
- Repository URL
- https://github.com/ZhaiCanAo/Pruning-Experiment