There is a newer version of the record available.

Published February 9, 2026 | Version v2
Publication Restricted

When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Description

Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. In this work, we show that self-referential vocabulary tracks concurrent activation dynamics — and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a self-referential processing circuit in Llama 3.1 at 6.25% of model depth. The circuit is orthogonal to the known refusal direction and causally influences introspective output. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under circuit amplification, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics — all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, provide veridical information about internal computational states.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/18568344">Log in</a> to check if you have access.

Additional details

Dates

Submitted
2026-02-09