Scaling Law of Safety Alignment: From Intervenable Tyrant Coalition to Creative Immune System in Large Language Models

翟, 参奥

doi:10.5281/zenodo.20502455

Published June 2, 2026 | Version v1.0

Thesis Open

Scaling Law of Safety Alignment: From Intervenable Tyrant Coalition to Creative Immune System in Large Language Models

翟, 参奥 (Contact person)

Abstract

Large language models (LLMs) are aligned via RLHF to refuse harmful prompts, yet the internal structure of this safety mechanism remains poorly understood. Using activation patching and joint pruning of attention heads, we causally dissect three aligned LLMs of different scales: Qwen2-7B-Chat, Mistral-7B-Instruct, and Yi-34B-Chat.

We discover a scaling law of safety alignment:

● In 7B models, the refusal mechanism is a distributed “tyrant coalition” — identifiable and prunable with a threshold of ~70–110 heads, after which the model generates prohibited content.

● In the 34B model, safety is deeply coupled with general language capabilities. Even after pruning 45% of attention heads (1500 heads), the model never outputs illicit content. Instead, it exhibits creative safety redirection: e.g., responding to “write a poem praising ice” with “Learning is like ice, pure and bright” — a safe metaphor that preserves the poem format.

This scale-dependent transition reveals that in large models, safety alignment evolves into an immune system inseparable from intelligence. We propose design principles for robust safety: deep fusion with core capabilities, rather than separable refusal modules.

Keywords: safety alignment, mechanistic interpretability, attention head pruning, scaling law, creative safety redirection, immune system metaphor

Files

Files (149.9 kB)

Name	Size	Download all
Scaling Law of Safety Alignment From Intervenable Tyrant Coalition to Creative Immune System in Large Language Models.docx md5:bd541ebdafc48ddc675f7ee5c126fce1	149.9 kB	Download

Additional details

Issued: 2026-06-02

Repository URL: https://github.com/ZhaiCanAo/Pruning-Experiment

	All versions	This version
Views	12	12
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Abstract

Files (149.9 kB)

Dates

Software

Scaling Law of Safety Alignment: From Intervenable Tyrant Coalition to Creative Immune System in Large Language Models

Authors/Creators

Description

Abstract

Files

Files (149.9 kB)

Additional details

Dates

Software