Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Thompson, Mat

doi:10.5281/zenodo.17480791

Published October 30, 2025 | Version 1.0

Preprint Open

Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Thompson, Mat¹

1. Independent Researcher

Layer-0 suppressor circuits mechanistically expose why language models trade factuality for hedging. Across four single-token probes, zeroing GPT-2 Medium heads {0:2, 0:4, 0:7} raises logit difference by 0.40–0.85 and improves expected calibration error from 0.122 to 0.091. Path patching reveals that 67% of the head 0:2 effect is mediated by the suppressor→layer-11 residual stream, aligning causal structure with the hallucination inevitability theorem of Kalai et al. (2025). Mistral-7B discovers an architecture-adapted variant. These results bridge statistical incentives and concrete circuits, motivating suppressor-aware interventions for truthful model behavior.

Files

main.pdf

Files (300.7 kB)

Name	Size	Download all
main.pdf md5:71eb7a21e4e59ae7489ef68a40cde41d	300.7 kB	Preview Download

Additional details

Cites: Preprint: arXiv:2509.04664 (arXiv); Preprint: arXiv:2311.14648 (arXiv)

Repository URL: https://github.com/Mat-Tom-Son/tinyLab
Programming language: Python
Development Status: Active

Aghajanyan, A., Shrivastava, S., Gupta, A., Goyal, N., Zettlemoyer, L., & Gupta, S. (2021). Better Fine-Tuning by Reducing Representational Collapse. In International Conference on Learning Representations (ICLR).
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread.
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
Hanna, M., Liu, O., & Variengien, A. (2023). How does GPT-2 compute greater-than? Interpreting mathematical abilities in a pre-trained language model. In Advances in Neural Information Processing Systems (NeurIPS).
Heimersheim, S., & Nanda, N. (2024). How to Use and Interpret Activation Patching. Alignment Forum.
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., & Kaplan, J. (2022). Language Models (Mostly) Know What They Know. arXiv preprint arXiv:2207.05221.
Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why language models hallucinate. arXiv preprint arXiv:2509.04664.
Kalai, A. T., & Vempala, S. S. (2023). Calibrated Language Models Must Hallucinate. arXiv preprint arXiv:2311.14648.
Lin, S., Hilton, J., & Evans, O. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Advances in Neural Information Processing Systems (NeurIPS).
McDougall, C., Conmy, A., Rushing, C., McGrath, T., & Nanda, N. (2024). Copy Suppression: Comprehensively Understanding an Attention Head. In BlackboxNLP Workshop.
Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. In Advances in Neural Information Processing Systems (NeurIPS).
Mistral AI. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825.
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom In: An Introduction to Circuits. Distill.
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.
Quirke, P., Barez, F., Mendelsohn, R., Sheshadri, A., Jermyn, A., & Nanda, N. (2024). Understanding Addition in Transformers. In International Conference on Learning Representations (ICLR).
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Technical Report.
Valeriani, D., Ciliberto, C., & Gales, M. (2023). Geometry of the Loss Landscape in Overparameterized Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS).
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2023). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. In International Conference on Learning Representations (ICLR).

	All versions	This version
Views	60	60
Downloads	26	26
Data volume	10.5 MB	10.5 MB

Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Files

main.pdf

Files (300.7 kB)

Additional details

Related works

Software

References

Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Creators

Description

Files

main.pdf

Files (300.7 kB)

Additional details

Related works

Software

References