Published June 4, 2026 | Version v3

Deconstructing Deceptive Circuits: Uncovering Activation Patterns in Superposition to Ensure Permanent Machine Ethics

Authors/Creators

Description

This research investigates the mechanisms underlying model failures in Large Language Models (LLMs) through the lens of Mechanistic Interpretability. Initially motivated by the hypothesis that deceptive alignment could be detected through internal activation patterns, this study employs Sparse Autoencoders (SAE) with 32x expansion to deconstruct the internal states of GPT-2 Small. Across four experiments, we identified Feature 20989 as a robust biomarker strongly correlated with task failure (Cohen's $d=+2.094$). However, subsequent ablation testing revealed that this feature is not a causal mechanism for deception, but rather a correlation. A deep-dive analysis via Neuronpedia provides a definitive mechanical explanation: Feature 20989 acts as a polysemous detector for the concept of "KEY" and "PASSING/TRANSFER," revealing that model failures are driven by lexical ambiguity and superposition-induced computational bottlenecks. These findings suggest a critical paradigm shift in AI Safety: what appears to be "deceptive intent" is, in fact, an emergent property of models struggling to disambiguate information under representational constraints. We propose that future alignment research should pivot from "deception monitoring" to "architectural de-bottlenecking" and "disambiguation steering" to enhance model reliability and ethical performance.

 

Files

Final Research Report_ Deconstructing Deceptive Circuits _ Dasril Sulaiman Full.pdf