Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging
Description
Layer‑0 “suppressor” heads mechanistically instantiate why language models trade factuality for hedging under uncertainty. In GPT‑2 Medium, ablating a small layer‑0 coalition (heads 0:2, 0:4, 0:7) reliably improves factual preference (ΔLD ≈ +0.40–+0.85 across factual, negation, counterfactual, and logic probes), sharpens calibration (ECE 0.122→0.091, Brier and NLL also improve), and yields consistent gains on span‑aware multi‑token metrics. Causal path patching shows ≈67% of head 0:2’s effect is mediated by a specific Layer‑0→Layer‑11 residual pathway, supporting an early‑layer hedging “attractor” that downstream layers do not fully undo. Direct geometric measurements strengthen this picture: suppressors dramatically flatten output distributions (ΔH = −2.4 to −3.8 nats, p < 0.02 vs. random layer‑0 controls) and reduce early trajectory curvature at layer‑0, while activation‑space entropies exhibit an expansion‑and‑rotation pattern rather than naive compression. The motif generalizes: Mistral‑7B learns an architecture‑adapted suppressor pair (heads 0:22/0:23) opposed by an anti‑suppressor on logic tasks, and a Pythia variance‑dampening probe finds strong early‑layer dampeners concentrated at the first bottleneck. Methodologically, we follow a prediction‑first bottleneck argument plus Kalai et al.’s hallucination inevitability theorem, then validate with random head baselines (>99th‑percentile tails), multi‑seed bootstrap CIs, CUDA replication for GPT‑2, and a minimal OV‑direction steering intervention that smoothly modulates ΔLD/ECE without degrading non‑target probes. We argue that pure next‑token pretraining already induces layer‑0 suppression and predict that RLHF/Constitutional‑AI style objectives intensify and reorient (rather than remove) these circuits, highlighting early‑layer suppressors as constrained, geometry‑driven solutions and a natural target for future training‑time control experiments.
Files
main.pdf
Files
(453.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:037622954d8bfd86b8d9d7e081c1e111
|
453.4 kB | Preview Download |
Additional details
Related works
- Cites
- Preprint: arXiv:2509.04664 (arXiv)
- Preprint: arXiv:2311.14648 (arXiv)
Software
- Repository URL
- https://github.com/Mat-Tom-Son/tinyLab
- Programming language
- Python
- Development Status
- Active
References
- Aghajanyan, A., Shrivastava, S., Gupta, A., Goyal, N., Zettlemoyer, L., & Gupta, S. (2021). Better Fine-Tuning by Reducing Representational Collapse. In International Conference on Learning Representations (ICLR).
- Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.
- Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread.
- Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
- Hanna, M., Liu, O., & Variengien, A. (2023). How does GPT-2 compute greater-than? Interpreting mathematical abilities in a pre-trained language model. In Advances in Neural Information Processing Systems (NeurIPS).
- Heimersheim, S., & Nanda, N. (2024). How to Use and Interpret Activation Patching. Alignment Forum.
- Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., & Kaplan, J. (2022). Language Models (Mostly) Know What They Know. arXiv preprint arXiv:2207.05221.
- Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why language models hallucinate. arXiv preprint arXiv:2509.04664.
- Kalai, A. T., & Vempala, S. S. (2023). Calibrated Language Models Must Hallucinate. arXiv preprint arXiv:2311.14648.
- Lin, S., Hilton, J., & Evans, O. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Advances in Neural Information Processing Systems (NeurIPS).
- McDougall, C., Conmy, A., Rushing, C., McGrath, T., & Nanda, N. (2024). Copy Suppression: Comprehensively Understanding an Attention Head. In BlackboxNLP Workshop.
- Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. In Advances in Neural Information Processing Systems (NeurIPS).
- Mistral AI. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825.
- Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom In: An Introduction to Circuits. Distill.
- Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.
- Quirke, P., Barez, F., Mendelsohn, R., Sheshadri, A., Jermyn, A., & Nanda, N. (2024). Understanding Addition in Transformers. In International Conference on Learning Representations (ICLR).
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Technical Report.
- Valeriani, D., Ciliberto, C., & Gales, M. (2023). Geometry of the Loss Landscape in Overparameterized Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS).
- Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2023). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. In International Conference on Learning Representations (ICLR).