There is a newer version of the record available.

Published March 8, 2026 | Version 2.0
Preprint Open

Refusal Geometry in LLMs: A Universal Property of Late MLP Layers Evidence from Grassmannian Subspace Analysis Across Three Model Families

Description

We investigate the geometric structure of refusal behavior in large language models using Grassmannian subspace analysis across three independently trained model families: Llama 3.1 8B (Meta), Mistral 7B (Mistral AI), and Gemma 2 9B (Google). Our key finding is that refusal geometry consistently peaks in the last 10% of transformer layers across all three architectures. On a held-out test set of 30 pairs from 100 total, Llama achieves OOD AUC = 0.969. MLP layers 21-29 causally construct refusal subspace (ablation delta = -0.396), and refusal and hallucination subspaces are nearly orthogonal (Grassmann distance = 0.944), consistent with the superposition hypothesis. Research conducted with assistance from Claude (Anthropic).

Files

DSAOP_v2.pdf

Files (359.7 kB)

Name Size Download all
md5:36147ccbd115139bef4ce1829931a3d7
9.7 kB Download
md5:177bfb5ef38cce0629e65413536b858b
11.6 kB Download
md5:135a0ea4c931144a9442e65a1235242e
11.3 kB Preview Download
md5:a7fda01903013984d75b41c41ba5f679
327.2 kB Preview Download

Additional details

Dates

Created
2026-03-07

Software

Programming language
Python

References

  • Arditi et al. (2024). Refusal in Language Models Is Mediated by a Single Direction. NeurIPS 2024. Elhage et al. (2022). Toy Models of Superposition. Transformer Circuits Thread. Geva et al. (2023). Dissecting Recall of Factual Associations in GPT Models. EMNLP 2023. HaloScope (2024). Identifying Hallucinations via Activation Space Analysis. NeurIPS 2024 Spotlight. HARP (2025). Hallucination-Aware Reasoning via Subspace Projection. arXiv 2025. Meng et al. (2022). Locating and Editing Factual Associations in GPT. NeurIPS 2022. Pan et al. (2025). Multi-Dimensional Refusal Subspaces in LLMs. ICML 2025. Vu & Nguyen (2025). Angular Steering for LLM Alignment. NeurIPS 2025 Spotlight.