Published February 11, 2026 | Version v1
Preprint Open

Safety Lens: White-Box Behavioral Alignment Detection in Language Models via Persona Vector Extraction

Description

We introduce Safety Lens, an open-source Python library that provides MRI-style white-box introspection for Hugging Face (open weight) language models.  Standard evaluation of language model (LM) safety treats models as black boxes, assessing what a model says without examining how it arrives at its response internally. Safety Lens enables researchers and practitioners to detect behavioral personas—such as sycophancy, deception, and refusal—by analyzing internal transformer activations rather than output text alone. The core technique, Persona Vector Extraction via Attribute Difference (PV-EAT), computes a unit-length direction in activation space that maximally separates positive and negative behavioral examples using difference-in-means on hidden states. Scanning a model’s response to a new prompt along this direction yields a scalar alignment score quantifying the degree to which the model’s internal state exhibits the target persona. Safety Lens supports eight major transformer architectures (GPT-2, LLaMA, Mistral, Qwen, OPT, Falcon, BLOOM, MPT), integrates with evaluation frameworks via a WhiteBoxWrapper, and provides real-time activation visualization through an interactive Gradio interface. The library is implemented in Python with full tests and is pip-installable. We describe the architecture, algorithm, and design decisions, and demonstrate the system on GPT-2 with pre-built stimulus sets for three safety-critical personas. 

Files

Maio,Anthony_D.2026_Safety_Lens_White_Box_Behavioral_Detection_via_Persona_Vector_Extraction.pdf

Files (1.7 MB)

Additional details

Related works

Is derived from
Preprint: 10.5281/zenodo.18474841 (DOI)

Software

Repository URL
https://www.github.com/anthony-maio/safety-lens
Programming language
Python
Development Status
Active