When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
Authors/Creators
Description
Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.
Guided repro: https://github.com/patternmatcher/TRACE-REPRO
Files
003-When Models Examine Themselves.pdf
Files
(3.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:fd7f65045953a14a078a48ce8974da40
|
2.5 MB | Preview Download |
|
md5:79925b822bd6c0bc89772b0fc6a4002d
|
5.5 kB | Download |
|
md5:11073c58675540a5c029dcc6c3ccad33
|
5.4 kB | Download |
|
md5:85cb8955365caed81903e835cabea6d2
|
6.2 kB | Download |
|
md5:c4e3ec6863d593585445641736ae63b8
|
1.5 kB | Download |
|
md5:18d44385b49621fc34a291ccbd4cd8ad
|
1.7 kB | Download |
|
md5:3bb0f462c398897ee37a32b571c578d6
|
1.3 kB | Download |
|
md5:03bf74f90f9dd3fe7abf70603ed0ddad
|
3.0 kB | Download |
|
md5:84d2d42ef0c020a0e86e743e7f5f842f
|
1.8 kB | Download |
|
md5:b8c98072168776bab6cc5da2dec2cf52
|
1.8 kB | Download |
|
md5:1d2d602f928defc02e815684182897b8
|
108.4 kB | Preview Download |
|
md5:e3e000013a170cc6b24ace14e560dedc
|
62.0 kB | Preview Download |
|
md5:4f5fa2afccd42073e398445fd3fbfd73
|
13.0 kB | Preview Download |
|
md5:057f0fd8acaa8edbc65993c3e7842759
|
2.8 kB | Preview Download |
|
md5:6848d485079b53412a8ffd8b5fb25455
|
220.4 kB | Preview Download |
|
md5:536d997ed307a609c0274a70432d46ab
|
343.8 kB | Preview Download |
|
md5:68bc16f01561b582c2a718fd17ef69cd
|
496.4 kB | Preview Download |
|
md5:14caed0a6be415c6899e0de9dde905ea
|
1.8 kB | Preview Download |
|
md5:36be2484dacf60794ea52c31689a5397
|
5.1 kB | Download |
|
md5:f2257491f983850f589b4414a649f763
|
8.2 kB | Download |
Additional details
Dates
- Submitted
-
2026-02-10
Software
- Repository URL
- https://github.com/patternmatcher