Deconstructing Deceptive Circuits: Uncovering Activation Patterns in Superposition to Ensure Permanent Machine Ethics

Sulaiman, Dasril

doi:10.5281/zenodo.20540116

Published June 4, 2026 | Version v3

Technical note Open

Deconstructing Deceptive Circuits: Uncovering Activation Patterns in Superposition to Ensure Permanent Machine Ethics

Sulaiman, Dasril

This research investigates the mechanisms underlying model failures in Large Language Models (LLMs) through the lens of Mechanistic Interpretability. Initially motivated by the hypothesis that deceptive alignment could be detected through internal activation patterns, this study employs Sparse Autoencoders (SAE) with 32x expansion to deconstruct the internal states of GPT-2 Small. Across four experiments, we identified Feature 20989 as a robust biomarker strongly correlated with task failure (Cohen's $$d=+2.094$$ ). However, subsequent ablation testing revealed that this feature is not a causal mechanism for deception, but rather a correlation. A deep-dive analysis via Neuronpedia provides a definitive mechanical explanation: Feature 20989 acts as a polysemous detector for the concept of "KEY" and "PASSING/TRANSFER," revealing that model failures are driven by lexical ambiguity and superposition-induced computational bottlenecks. These findings suggest a critical paradigm shift in AI Safety: what appears to be "deceptive intent" is, in fact, an emergent property of models struggling to disambiguate information under representational constraints. We propose that future alignment research should pivot from "deception monitoring" to "architectural de-bottlenecking" and "disambiguation steering" to enhance model reliability and ethical performance.

Files

Final Research Report_ Deconstructing Deceptive Circuits _ Dasril Sulaiman Full.pdf

Files (2.0 MB)

Name	Size	Download all
Final Research Report_ Deconstructing Deceptive Circuits _ Dasril Sulaiman Full.pdf md5:3929a7487da76d7213af0aac25a251e1	2.0 MB	Preview Download

	All versions	This version
Views	37	11
Downloads	13	3
Data volume	26.7 MB	6.0 MB

Deconstructing Deceptive Circuits: Uncovering Activation Patterns in Superposition to Ensure Permanent Machine Ethics

Authors/Creators

Description

Files

Final Research Report_ Deconstructing Deceptive Circuits _ Dasril Sulaiman Full.pdf

Files (2.0 MB)