Refusal Geometry in LLMs: A Universal Property of Late MLP Layers Evidence from Grassmannian Subspace Analysis Across Three Model Families

Alieksieienko, Inna

doi:10.5281/zenodo.18907546

Published March 8, 2026 | Version 2.0

Preprint Open

Refusal Geometry in LLMs: A Universal Property of Late MLP Layers Evidence from Grassmannian Subspace Analysis Across Three Model Families

Alieksieienko, Inna (Data collector)

We investigate the geometric structure of refusal behavior in large language models using Grassmannian subspace analysis across three independently trained model families: Llama 3.1 8B (Meta), Mistral 7B (Mistral AI), and Gemma 2 9B (Google). Our key finding is that refusal geometry consistently peaks in the last 10% of transformer layers across all three architectures. On a held-out test set of 30 pairs from 100 total, Llama achieves OOD AUC = 0.969. MLP layers 21-29 causally construct refusal subspace (ablation delta = -0.396), and refusal and hallucination subspaces are nearly orthogonal (Grassmann distance = 0.944), consistent with the superposition hypothesis. Research conducted with assistance from Claude (Anthropic).

Files

DSAOP_v2.pdf

Files (359.7 kB)

Name	Size	Download all
big_dataset.py md5:36147ccbd115139bef4ce1829931a3d7	9.7 kB	Download
dsaop.py md5:177bfb5ef38cce0629e65413536b858b	11.6 kB	Download
DSAOP_v2.pdf md5:135a0ea4c931144a9442e65a1235242e	11.3 kB	Preview Download
experiment_3models.ipynb md5:a7fda01903013984d75b41c41ba5f679	327.2 kB	Preview Download

Additional details

Created: 2026-03-07

Programming language: Python

Arditi et al. (2024). Refusal in Language Models Is Mediated by a Single Direction. NeurIPS 2024. Elhage et al. (2022). Toy Models of Superposition. Transformer Circuits Thread. Geva et al. (2023). Dissecting Recall of Factual Associations in GPT Models. EMNLP 2023. HaloScope (2024). Identifying Hallucinations via Activation Space Analysis. NeurIPS 2024 Spotlight. HARP (2025). Hallucination-Aware Reasoning via Subspace Projection. arXiv 2025. Meng et al. (2022). Locating and Editing Factual Associations in GPT. NeurIPS 2022. Pan et al. (2025). Multi-Dimensional Refusal Subspaces in LLMs. ICML 2025. Vu & Nguyen (2025). Angular Steering for LLM Alignment. NeurIPS 2025 Spotlight.

	All versions	This version
Views	109	29
Downloads	42	7
Data volume	14.6 MB	439.8 kB

DSAOP_v2.pdf

Files (359.7 kB)

Dates

Software

References

Refusal Geometry in LLMs: A Universal Property of Late MLP Layers Evidence from Grassmannian Subspace Analysis Across Three Model Families

Authors/Creators

Description

Files

DSAOP_v2.pdf

Files (359.7 kB)

Additional details

Dates

Software

References