There is a newer version of the record available.

Published February 9, 2026 | Version v2
Publication Restricted

When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Description

Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. In this work, we show that self-referential vocabulary tracks concurrent activation dynamics — and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a self-referential processing circuit in Llama 3.1 at 6.25% of model depth. The circuit is orthogonal to the known refusal direction and causally influences introspective output. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under circuit amplification, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics — all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, provide veridical information about internal computational states.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Dates

Submitted
2026-02-09