Project Spillover: Quantifying the Alignment Tax
Description
In the pursuit of AI safety, model alignment is often treated as a purely additive process—layering safety guards on top of intelligence. However, this view ignores the "Alignment Tax": the degradation of general reasoning capabilities caused by restrictive fine-tuning.
In this study, we treat the language model as a patient and the safety intervention as surgery. By performing a naive safety fine-tuning (LoRA) on GPT-2, we observed a catastrophic "capability spillover." While the model achieved a 100% refusal rate for harmful queries, it simultaneously lost basic arithmetic and coding abilities—a phenomenon we characterize as a digital lobotomy.
We utilize mechanistic interpretability to identify the specific internal circuits responsible for this collapse and propose future directions for more surgical alignment techniques.
Files
Project_Spillover_Preprint.pdf
Files
(193.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:4de4de367600f6eac19b0cfa2d6c9fed
|
193.1 kB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/rajsecrets/Project-Spillover
- Programming language
- Python
- Development Status
- Active