Project Spillover: Quantifying the Alignment Tax

Raj, Aditya

doi:10.5281/zenodo.18254555

Published January 15, 2026 | Version v1

Preprint Open

Project Spillover: Quantifying the Alignment Tax

Raj, Aditya (Project manager)¹

1. AI Safety India

In the pursuit of AI safety, model alignment is often treated as a purely additive process—layering safety guards on top of intelligence. However, this view ignores the "Alignment Tax": the degradation of general reasoning capabilities caused by restrictive fine-tuning.

In this study, we treat the language model as a patient and the safety intervention as surgery. By performing a naive safety fine-tuning (LoRA) on GPT-2, we observed a catastrophic "capability spillover." While the model achieved a 100% refusal rate for harmful queries, it simultaneously lost basic arithmetic and coding abilities—a phenomenon we characterize as a digital lobotomy.

We utilize mechanistic interpretability to identify the specific internal circuits responsible for this collapse and propose future directions for more surgical alignment techniques.

Files

Project_Spillover_Preprint.pdf

Files (193.1 kB)

Name	Size	Download all
Project_Spillover_Preprint.pdf md5:4de4de367600f6eac19b0cfa2d6c9fed	193.1 kB	Preview Download

Additional details

Repository URL: https://github.com/rajsecrets/Project-Spillover
Programming language: Python
Development Status: Active

	All versions	This version
Views	30	30
Downloads	14	14
Data volume	3.7 MB	3.7 MB

Project Spillover: Quantifying the Alignment Tax

Authors/Creators

Description

Files

Project_Spillover_Preprint.pdf

Files (193.1 kB)

Additional details

Software