Refusal Geometry in LLMs: A Universal Property of Late MLP Layers Evidence from Grassmannian Subspace Analysis Across Three Model Families
Authors/Creators
Description
We investigate the geometric structure of refusal behavior in large language models using Grassmannian subspace analysis across three independently trained model families: Llama 3.1 8B (Meta), Mistral 7B (Mistral AI), and Gemma 2 9B (Google). Our key finding is that refusal geometry consistently peaks in the last 10% of transformer layers across all three architectures. On a held-out test set of 30 pairs from 100 total, Llama achieves OOD AUC = 0.969. MLP layers 21-29 causally construct refusal subspace (ablation delta = -0.396), and refusal and hallucination subspaces are nearly orthogonal (Grassmann distance = 0.944), consistent with the superposition hypothesis. Research conducted with assistance from Claude (Anthropic).
Files
DSAOP_v2.pdf
Additional details
Dates
- Created
-
2026-03-07
Software
- Programming language
- Python
References
- Arditi et al. (2024). Refusal in Language Models Is Mediated by a Single Direction. NeurIPS 2024. Elhage et al. (2022). Toy Models of Superposition. Transformer Circuits Thread. Geva et al. (2023). Dissecting Recall of Factual Associations in GPT Models. EMNLP 2023. HaloScope (2024). Identifying Hallucinations via Activation Space Analysis. NeurIPS 2024 Spotlight. HARP (2025). Hallucination-Aware Reasoning via Subspace Projection. arXiv 2025. Meng et al. (2022). Locating and Editing Factual Associations in GPT. NeurIPS 2022. Pan et al. (2025). Multi-Dimensional Refusal Subspaces in LLMs. ICML 2025. Vu & Nguyen (2025). Angular Steering for LLM Alignment. NeurIPS 2025 Spotlight.