The Readout Regime: A Normal Form for Final-Residual Control of Frozen Transformers β and Its Capacity Limits
Authors/Creators
Description
Inference-time interventions on a frozen transformer — steering behavior, injecting facts,
suppressing outputs — are not interchangeable: where an additive intervention acts splits them
into two regimes of sharply different expressive power, and we give the exact theory of one.
An intervention that writes into the final residual stream (the unembedding/readout space;
installing a scaled row of the output-projection matrix lm_head is the exactly-characterizable
case) induces on the next-token logits, for every input, the transform π§ β¦ π (π₯) ⋅ π§ + π(π₯): a
ranking-preserving scalar temperature π (π₯) > 0 plus a re-ranking bias π(π₯) confined to a
fixed, low-dimensional set of directions chosen before any input is seen (T1; verified
against direct forward-pass computation to ≈ 5 × 10−6). The result that matters is a structure
theorem (T2): the input selects only a point in that fixed set and a temperature, so the reachable
re-ranking directions, over all inputs, have affine dimension at most the installed-slot count. In
plain terms: a readout install can re-weight and re-rank the options the model
already has, but cannot synthesize a new answer direction or compute a hidden
routing variable the addressing query does not already expose. We prove the readout
regime’s confinement and cite — not prove — evidence that the representation regime is not so
confined; that direct measurement is the main open item. The corollaries, carefully bounded:
a bounded readout install is not a hard override of a peaked prior, can only tip a decision
the context has already scaffolded near-balanced, and cannot compute a hidden intermediate.
Capacity has two faces, both empirical: across key-disjoint decisions installs compose bit-
exactly at scale (tens of modules, thousands of facts, Δ = 0); within one decision the readout
is winner-take-all (≈ 2 targets co-winnable, against an output projection of entropy-effective
rank ≈ 918). The control reading: tip a propensity in the readout regime — it is auditable, a
removable bias — but place any hard guarantee in a deterministic override outside the model.
The contribution is the boundary, stated precisely.
Files
main.pdf
Files
(218.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:6b0215e43ec0982975330040208f9548
|
218.3 kB | Preview Download |
Additional details
Dates
- Submitted
-
2025-06-01
Software
- Repository URL
- https://github.com/orbitnate/readout-regime
- Development Status
- Active