Published January 15, 2026 | Version v1
Preprint Open

Project Spillover: Quantifying the Alignment Tax

  • 1. AI Safety India

Description

In the pursuit of AI safety, model alignment is often treated as a purely additive process—layering safety guards on top of intelligence. However, this view ignores the "Alignment Tax": the degradation of general reasoning capabilities caused by restrictive fine-tuning.

In this study, we treat the language model as a patient and the safety intervention as surgery. By performing a naive safety fine-tuning (LoRA) on GPT-2, we observed a catastrophic "capability spillover." While the model achieved a 100% refusal rate for harmful queries, it simultaneously lost basic arithmetic and coding abilities—a phenomenon we characterize as a digital lobotomy.

We utilize mechanistic interpretability to identify the specific internal circuits responsible for this collapse and propose future directions for more surgical alignment techniques.

Files

Project_Spillover_Preprint.pdf

Files (193.1 kB)

Name Size Download all
md5:4de4de367600f6eac19b0cfa2d6c9fed
193.1 kB Preview Download

Additional details

Software

Repository URL
https://github.com/rajsecrets/Project-Spillover
Programming language
Python
Development Status
Active