Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Thompson, Mat

doi:10.5281/zenodo.17535315

Published October 29, 2025 | Version v5

Preprint Open

Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Thompson, Mat¹

1. Independent Researcher

Layer-0 "suppressor" heads encode why language models trade factuality for hedging under uncertainty. In GPT-2 Medium, ablating heads {0:2, 0:4, 0:7} increases logit-difference (ΔLD) by 0.40–0.85 across factual, negation, counterfactual, and logic probes, and improves calibration (ECE 0.122 → 0.091). Path patching shows ≈67% of head 0:2's effect is mediated by the Layer-0 → Layer-11 residual pathway, establishing a stable hedging attractor. New: direct geometric measurements reveal that suppressors flatten output distributions (ΔH = −2.4 to −3.8 nats; p < 0.02 vs. random controls) and reduce early-layer trajectory curvature across all four tasks, consistent with information-theoretic constraint. Mistral-7B exhibits an architecture-adapted variant (heads 0:22/0:23) with task-contingent anti-suppressor behavior. We include multi-seed runs with bootstrap CIs over prompts, span-aware multi-token metrics, random head baselines at >99th percentile, and a minimal OV-steering intervention that smoothly modulates ΔLD/ECE without harming non-target probes. Analysis connects suppressors to training objectives: pure next-token pretraining induces baseline suppression; we propose that RLHF and Constitutional AI intensify rather than eliminate them. Scope: decoder-only models, short prompts, deterministic Mac MPS hardware with replication roadmap for CUDA backends.

Files

main.pdf

Files (409.2 kB)

Name	Size	Download all
main.pdf md5:a1e622234de4d547c65e5323d078f05f	409.2 kB	Preview Download

Additional details

Cites: Preprint: arXiv:2509.04664 (arXiv); Preprint: arXiv:2311.14648 (arXiv)

Repository URL: https://github.com/Mat-Tom-Son/tinyLab
Programming language: Python
Development Status: Active

Aghajanyan, A., Shrivastava, S., Gupta, A., Goyal, N., Zettlemoyer, L., & Gupta, S. (2021). Better Fine-Tuning by Reducing Representational Collapse. In International Conference on Learning Representations (ICLR).
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread.
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
Hanna, M., Liu, O., & Variengien, A. (2023). How does GPT-2 compute greater-than? Interpreting mathematical abilities in a pre-trained language model. In Advances in Neural Information Processing Systems (NeurIPS).
Heimersheim, S., & Nanda, N. (2024). How to Use and Interpret Activation Patching. Alignment Forum.
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., & Kaplan, J. (2022). Language Models (Mostly) Know What They Know. arXiv preprint arXiv:2207.05221.
Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why language models hallucinate. arXiv preprint arXiv:2509.04664.
Kalai, A. T., & Vempala, S. S. (2023). Calibrated Language Models Must Hallucinate. arXiv preprint arXiv:2311.14648.
Lin, S., Hilton, J., & Evans, O. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Advances in Neural Information Processing Systems (NeurIPS).
McDougall, C., Conmy, A., Rushing, C., McGrath, T., & Nanda, N. (2024). Copy Suppression: Comprehensively Understanding an Attention Head. In BlackboxNLP Workshop.
Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. In Advances in Neural Information Processing Systems (NeurIPS).
Mistral AI. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825.
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom In: An Introduction to Circuits. Distill.
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.
Quirke, P., Barez, F., Mendelsohn, R., Sheshadri, A., Jermyn, A., & Nanda, N. (2024). Understanding Addition in Transformers. In International Conference on Learning Representations (ICLR).
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Technical Report.
Valeriani, D., Ciliberto, C., & Gales, M. (2023). Geometry of the Loss Landscape in Overparameterized Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS).
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2023). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. In International Conference on Learning Representations (ICLR).

	All versions	This version
Views	413	27
Downloads	277	25
Data volume	110.4 MB	11.9 MB

Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Files

main.pdf

Files (409.2 kB)

Additional details

Related works

Software

References

Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Creators

Description

Files

main.pdf

Files (409.2 kB)

Additional details

Related works

Software

References