Published October 29, 2025 | Version v6
Preprint Open

Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Creators

  • 1. Independent Researcher

Description

Layer‑0 “suppressor” heads mechanistically instantiate why language models trade factuality for hedging under uncertainty. In GPT‑2 Medium, ablating a small layer‑0 coalition (heads 0:2, 0:4, 0:7) reliably improves factual preference (ΔLD ≈ +0.40–+0.85 across factual, negation, counterfactual, and logic probes), sharpens calibration (ECE 0.122→0.091, Brier and NLL also improve), and yields consistent gains on span‑aware multi‑token metrics. Causal path patching shows ≈67% of head 0:2’s effect is mediated by a specific Layer‑0→Layer‑11 residual pathway, supporting an early‑layer hedging “attractor” that downstream layers do not fully undo. Direct geometric measurements strengthen this picture: suppressors dramatically flatten output distributions (ΔH = −2.4 to −3.8 nats, p < 0.02 vs. random layer‑0 controls) and reduce early trajectory curvature at layer‑0, while activation‑space entropies exhibit an expansion‑and‑rotation pattern rather than naive compression. The motif generalizes: Mistral‑7B learns an architecture‑adapted suppressor pair (heads 0:22/0:23) opposed by an anti‑suppressor on logic tasks, and a Pythia variance‑dampening probe finds strong early‑layer dampeners concentrated at the first bottleneck. Methodologically, we follow a prediction‑first bottleneck argument plus Kalai et al.’s hallucination inevitability theorem, then validate with random head baselines (>99th‑percentile tails), multi‑seed bootstrap CIs, CUDA replication for GPT‑2, and a minimal OV‑direction steering intervention that smoothly modulates ΔLD/ECE without degrading non‑target probes. We argue that pure next‑token pretraining already induces layer‑0 suppression and predict that RLHF/Constitutional‑AI style objectives intensify and reorient (rather than remove) these circuits, highlighting early‑layer suppressors as constrained, geometry‑driven solutions and a natural target for future training‑time control experiments.

Files

main.pdf

Files (453.4 kB)

Name Size Download all
md5:037622954d8bfd86b8d9d7e081c1e111
453.4 kB Preview Download

Additional details

Related works

Cites
Preprint: arXiv:2509.04664 (arXiv)
Preprint: arXiv:2311.14648 (arXiv)

Software

Repository URL
https://github.com/Mat-Tom-Son/tinyLab
Programming language
Python
Development Status
Active

References

  • Aghajanyan, A., Shrivastava, S., Gupta, A., Goyal, N., Zettlemoyer, L., & Gupta, S. (2021). Better Fine-Tuning by Reducing Representational Collapse. In International Conference on Learning Representations (ICLR).
  • Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.
  • Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread.
  • Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
  • Hanna, M., Liu, O., & Variengien, A. (2023). How does GPT-2 compute greater-than? Interpreting mathematical abilities in a pre-trained language model. In Advances in Neural Information Processing Systems (NeurIPS).
  • Heimersheim, S., & Nanda, N. (2024). How to Use and Interpret Activation Patching. Alignment Forum.
  • Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., & Kaplan, J. (2022). Language Models (Mostly) Know What They Know. arXiv preprint arXiv:2207.05221.
  • Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why language models hallucinate. arXiv preprint arXiv:2509.04664.
  • Kalai, A. T., & Vempala, S. S. (2023). Calibrated Language Models Must Hallucinate. arXiv preprint arXiv:2311.14648.
  • Lin, S., Hilton, J., & Evans, O. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Advances in Neural Information Processing Systems (NeurIPS).
  • McDougall, C., Conmy, A., Rushing, C., McGrath, T., & Nanda, N. (2024). Copy Suppression: Comprehensively Understanding an Attention Head. In BlackboxNLP Workshop.
  • Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. In Advances in Neural Information Processing Systems (NeurIPS).
  • Mistral AI. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825.
  • Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom In: An Introduction to Circuits. Distill.
  • Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.
  • Quirke, P., Barez, F., Mendelsohn, R., Sheshadri, A., Jermyn, A., & Nanda, N. (2024). Understanding Addition in Transformers. In International Conference on Learning Representations (ICLR).
  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Technical Report.
  • Valeriani, D., Ciliberto, C., & Gales, M. (2023). Geometry of the Loss Landscape in Overparameterized Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS).
  • Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2023). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. In International Conference on Learning Representations (ICLR).