Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Thompson, Mat

doi:10.5281/zenodo.17627441

Published October 29, 2025 | Version v6

Preprint Open

Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Thompson, Mat¹

1. Independent Researcher

Layer‑0 “suppressor” heads mechanistically instantiate why language models trade factuality for hedging under uncertainty. In GPT‑2 Medium, ablating a small layer‑0 coalition (heads 0:2, 0:4, 0:7) reliably improves factual preference (ΔLD ≈ +0.40–+0.85 across factual, negation, counterfactual, and logic probes), sharpens calibration (ECE 0.122→0.091, Brier and NLL also improve), and yields consistent gains on span‑aware multi‑token metrics. Causal path patching shows ≈67% of head 0:2’s effect is mediated by a specific Layer‑0→Layer‑11 residual pathway, supporting an early‑layer hedging “attractor” that downstream layers do not fully undo. Direct geometric measurements strengthen this picture: suppressors dramatically flatten output distributions (ΔH = −2.4 to −3.8 nats, p < 0.02 vs. random layer‑0 controls) and reduce early trajectory curvature at layer‑0, while activation‑space entropies exhibit an expansion‑and‑rotation pattern rather than naive compression. The motif generalizes: Mistral‑7B learns an architecture‑adapted suppressor pair (heads 0:22/0:23) opposed by an anti‑suppressor on logic tasks, and a Pythia variance‑dampening probe finds strong early‑layer dampeners concentrated at the first bottleneck. Methodologically, we follow a prediction‑first bottleneck argument plus Kalai et al.’s hallucination inevitability theorem, then validate with random head baselines (>99th‑percentile tails), multi‑seed bootstrap CIs, CUDA replication for GPT‑2, and a minimal OV‑direction steering intervention that smoothly modulates ΔLD/ECE without degrading non‑target probes. We argue that pure next‑token pretraining already induces layer‑0 suppression and predict that RLHF/Constitutional‑AI style objectives intensify and reorient (rather than remove) these circuits, highlighting early‑layer suppressors as constrained, geometry‑driven solutions and a natural target for future training‑time control experiments.

Files

main.pdf

Files (453.4 kB)

Name	Size	Download all
main.pdf md5:037622954d8bfd86b8d9d7e081c1e111	453.4 kB	Preview Download

Additional details

Cites: Preprint: arXiv:2509.04664 (arXiv); Preprint: arXiv:2311.14648 (arXiv)

Repository URL: https://github.com/Mat-Tom-Son/tinyLab
Programming language: Python
Development Status: Active

Aghajanyan, A., Shrivastava, S., Gupta, A., Goyal, N., Zettlemoyer, L., & Gupta, S. (2021). Better Fine-Tuning by Reducing Representational Collapse. In International Conference on Learning Representations (ICLR).
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread.
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
Hanna, M., Liu, O., & Variengien, A. (2023). How does GPT-2 compute greater-than? Interpreting mathematical abilities in a pre-trained language model. In Advances in Neural Information Processing Systems (NeurIPS).
Heimersheim, S., & Nanda, N. (2024). How to Use and Interpret Activation Patching. Alignment Forum.
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., & Kaplan, J. (2022). Language Models (Mostly) Know What They Know. arXiv preprint arXiv:2207.05221.
Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why language models hallucinate. arXiv preprint arXiv:2509.04664.
Kalai, A. T., & Vempala, S. S. (2023). Calibrated Language Models Must Hallucinate. arXiv preprint arXiv:2311.14648.
Lin, S., Hilton, J., & Evans, O. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Advances in Neural Information Processing Systems (NeurIPS).
McDougall, C., Conmy, A., Rushing, C., McGrath, T., & Nanda, N. (2024). Copy Suppression: Comprehensively Understanding an Attention Head. In BlackboxNLP Workshop.
Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. In Advances in Neural Information Processing Systems (NeurIPS).
Mistral AI. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825.
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom In: An Introduction to Circuits. Distill.
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.
Quirke, P., Barez, F., Mendelsohn, R., Sheshadri, A., Jermyn, A., & Nanda, N. (2024). Understanding Addition in Transformers. In International Conference on Learning Representations (ICLR).
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Technical Report.
Valeriani, D., Ciliberto, C., & Gales, M. (2023). Geometry of the Loss Landscape in Overparameterized Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS).
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2023). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. In International Conference on Learning Representations (ICLR).

	All versions	This version
Views	413	9
Downloads	278	8
Data volume	110.9 MB	4.5 MB

Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Files

main.pdf

Files (453.4 kB)

Additional details

Related works

Software

References

Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Creators

Description

Files

main.pdf

Files (453.4 kB)

Additional details

Related works

Software

References