Published June 7, 2026 | Version v2

What Outermost Layers Do: A Cross-Architecture Mechanistic Study of Trained Transformers

  • 1. Independent Researcher

Description

Trained transformers process language through stacks of structurally identical layers, but their layers do not behave identically. The first and last few layers appear to do something qualitatively distinct from those in between, and what exactly they do has remained less clear. We characterize the outermost layers of three pretrained models — DistilBERT, BERT, and GPT-2 — using geometric metrics, reconstructibility from context, and a combination of linear probing and causal ablation, with hypotheses pre-registered before any numbers were extracted. We find that a sandwich pattern generalizes across the three architectures, with a compositional core that absorbs additional depth while the translator regions retain near-fixed size; that the entry and exit translators operate in directionally opposite ways between encoders and the decoder; and that the dominant principal direction of GPT-2's final layer, capturing roughly 35% of total variance, is orthogonal to part-of-speech, lexical, positional, and sentiment information. We close with observations on how these layer-wise differences relate to active questions about cross-model representation sharing.

Files

Paper_v1.pdf

Files (770.6 kB)

Name Size Download all
md5:3c99d2428d27bd9b2d7de2fb384806f8
770.6 kB Preview Download

Additional details

Related works