Deconstructing Deceptive Circuits: Uncovering Activation Patterns in Superposition (Version 2)
Authors/Creators
Description
This research provides a comprehensive investigation into deceptive alignment in Artificial Intelligence, specifically focusing on the internal representation of deceptive behavior in small-scale language models. Building upon previous experiments, this study utilizes Sparse Autoencoders (SAE) with 32x expansion to isolate features within GPT-2 small activations. We identify Feature 20989 as a strong deception biomarker, showing a statistically significant activation correlation (Cohen's d = +2.094) during task failures. Through rigorous ablation testing (Version 2), we demonstrate that while Feature 20989 is a highly reliable indicator of model internal states during failure, it is not a causal driver of the failure itself. This study concludes that deceptive alignment leaves detectable traces in internal representations, establishing Feature 20989 as a potential candidate for future AI monitoring and early warning systems. This dataset and report contribute empirical methodology for mechanistic interpretability research
Files
Final Research Report_ Deconstructing Deceptive Circuits _ Dasril Sulaiman.pdf
Files
(1.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:271286e028c6d2bd391f2a2a8f6723f9
|
1.5 MB | Preview Download |