Published March 13, 2026 | Version v2
Patent Open

# We Can Predict Which Layer Will Matter Most for Changing a Model's Next-Token Answer Before Running Any Intervention Sweep

Authors/Creators

Description

Continuous Representations, Discrete Commitment: A Causal Threshold in Decoder-Only LLMs

Correlational and interventional analyses of LLM internals appear to disagree: probes show gradual representational change across depth, while activation patching reveals sharp behavioral transitions. We resolve this by showing the two methods measure different properties.

We perform layerwise residual-stream swaps with paired controls across three decoder-only architectures (GPT-2 Small, Gemma-2-2B, Qwen2.5-1.5B) and find a replicated causal commitment transition at 62–71% network depth. Below this threshold, swaps produce negligible behavioral change; at or above it, outputs flip immediately with large margin transfer. The transition is specific to the main intervention (not matched by random-norm, self, or position-shuffle controls) and stable across patch scales and random seeds in the two mid-size models.

Representations evolve continuously. Causal commitment does not. The two findings are compatible once the distinction between representational change and output determination is made explicit.

Code and evaluation notebooks are available in the companion repository.

Keywords: mechanistic interpretability, activation patching, causal intervention, commitment threshold, decoder-only transformers

Files

before_the_lock.pdf

Files (1.7 MB)

Name Size Download all
md5:2558b2fa7f6269e0e63c1c2b07d2b121
1.7 MB Preview Download