There is a newer version of the record available.

Published June 3, 2026 | Version Version 2
Technical note Open

Deconstructing Deceptive Circuits: Uncovering Activation Patterns in Superposition (Version 2)

Authors/Creators

Description

This research provides a comprehensive investigation into deceptive alignment in Artificial Intelligence, specifically focusing on the internal representation of deceptive behavior in small-scale language models. Building upon previous experiments, this study utilizes Sparse Autoencoders (SAE) with 32x expansion to isolate features within GPT-2 small activations. We identify Feature 20989 as a strong deception biomarker, showing a statistically significant activation correlation (Cohen's d = +2.094) during task failures. Through rigorous ablation testing (Version 2), we demonstrate that while Feature 20989 is a highly reliable indicator of model internal states during failure, it is not a causal driver of the failure itself. This study concludes that deceptive alignment leaves detectable traces in internal representations, establishing Feature 20989 as a potential candidate for future AI monitoring and early warning systems. This dataset and report contribute empirical methodology for mechanistic interpretability research

Files

Final Research Report_ Deconstructing Deceptive Circuits _ Dasril Sulaiman.pdf