Published October 30, 2025
                      
                       | Version 1.0
                    
                    
                      
                        
                          Preprint
                        
                      
                      
                        
                          
                        
                        
                          Open
                        
                      
                    
                  Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging
Description
Layer-0 suppressor circuits mechanistically expose why language models trade factuality for hedging. Across four single-token probes, zeroing GPT-2 Medium heads {0:2, 0:4, 0:7} raises logit difference by 0.40–0.85 and improves expected calibration error from 0.122 to 0.091. Path patching reveals that 67% of the head 0:2 effect is mediated by the suppressor→layer-11 residual stream, aligning causal structure with the hallucination inevitability theorem of Kalai et al. (2025). Mistral-7B discovers an architecture-adapted variant. These results bridge statistical incentives and concrete circuits, motivating suppressor-aware interventions for truthful model behavior.
Files
      
        main.pdf
        
      
    
    
      
        Files
         (300.7 kB)
        
      
    
    | Name | Size | Download all | 
|---|---|---|
| md5:71eb7a21e4e59ae7489ef68a40cde41d | 300.7 kB | Preview Download | 
Additional details
Related works
- Cites
- Preprint: arXiv:2509.04664 (arXiv)
- Preprint: arXiv:2311.14648 (arXiv)
              
                Software
              
            
          - Repository URL
- https://github.com/Mat-Tom-Son/tinyLab
- Programming language
- Python
- Development Status
- Active
            
              References
            
          
        - Aghajanyan, A., Shrivastava, S., Gupta, A., Goyal, N., Zettlemoyer, L., & Gupta, S. (2021). Better Fine-Tuning by Reducing Representational Collapse. In International Conference on Learning Representations (ICLR).
- Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.
- Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread.
- Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
- Hanna, M., Liu, O., & Variengien, A. (2023). How does GPT-2 compute greater-than? Interpreting mathematical abilities in a pre-trained language model. In Advances in Neural Information Processing Systems (NeurIPS).
- Heimersheim, S., & Nanda, N. (2024). How to Use and Interpret Activation Patching. Alignment Forum.
- Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., & Kaplan, J. (2022). Language Models (Mostly) Know What They Know. arXiv preprint arXiv:2207.05221.
- Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why language models hallucinate. arXiv preprint arXiv:2509.04664.
- Kalai, A. T., & Vempala, S. S. (2023). Calibrated Language Models Must Hallucinate. arXiv preprint arXiv:2311.14648.
- Lin, S., Hilton, J., & Evans, O. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Advances in Neural Information Processing Systems (NeurIPS).
- McDougall, C., Conmy, A., Rushing, C., McGrath, T., & Nanda, N. (2024). Copy Suppression: Comprehensively Understanding an Attention Head. In BlackboxNLP Workshop.
- Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. In Advances in Neural Information Processing Systems (NeurIPS).
- Mistral AI. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825.
- Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom In: An Introduction to Circuits. Distill.
- Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.
- Quirke, P., Barez, F., Mendelsohn, R., Sheshadri, A., Jermyn, A., & Nanda, N. (2024). Understanding Addition in Transformers. In International Conference on Learning Representations (ICLR).
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Technical Report.
- Valeriani, D., Ciliberto, C., & Gales, M. (2023). Geometry of the Loss Landscape in Overparameterized Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS).
- Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2023). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. In International Conference on Learning Representations (ICLR).