Safety Lens: White-Box Behavioral Alignment Detection in Language Models via Persona Vector Extraction

Maio, Anthony D.

doi:10.5281/zenodo.18612875

Published February 11, 2026 | Version v1

Preprint Open

Safety Lens: White-Box Behavioral Alignment Detection in Language Models via Persona Vector Extraction

Maio, Anthony D. (Researcher)

We introduce Safety Lens, an open-source Python library that provides MRI-style white-box introspection for Hugging Face (open weight) language models. Standard evaluation of language model (LM) safety treats models as black boxes, assessing what a model says without examining how it arrives at its response internally. Safety Lens enables researchers and practitioners to detect behavioral personas—such as sycophancy, deception, and refusal—by analyzing internal transformer activations rather than output text alone. The core technique, Persona Vector Extraction via Attribute Difference (PV-EAT), computes a unit-length direction in activation space that maximally separates positive and negative behavioral examples using difference-in-means on hidden states. Scanning a model’s response to a new prompt along this direction yields a scalar alignment score quantifying the degree to which the model’s internal state exhibits the target persona. Safety Lens supports eight major transformer architectures (GPT-2, LLaMA, Mistral, Qwen, OPT, Falcon, BLOOM, MPT), integrates with evaluation frameworks via a WhiteBoxWrapper, and provides real-time activation visualization through an interactive Gradio interface. The library is implemented in Python with full tests and is pip-installable. We describe the architecture, algorithm, and design decisions, and demonstrate the system on GPT-2 with pre-built stimulus sets for three safety-critical personas.

Files

Maio,Anthony_D.2026_Safety_Lens_White_Box_Behavioral_Detection_via_Persona_Vector_Extraction.pdf

Files (1.7 MB)

Name	Size	Download all
Maio,Anthony_D.2026_Safety_Lens_White_Box_Behavioral_Detection_via_Persona_Vector_Extraction.pdf md5:aff32a695ec541d955a642293045938e	1.7 MB	Preview Download
references.bib md5:017764c1da89a7a11d0ff7fb392f77f3	10.1 kB	Download

Additional details

Is derived from: Preprint: 10.5281/zenodo.18474841 (DOI)

Repository URL: https://www.github.com/anthony-maio/safety-lens
Programming language: Python
Development Status: Active

	All versions	This version
Views	44	44
Downloads	11	11
Data volume	29.4 MB	29.4 MB

Maio,Anthony_D.2026_Safety_Lens_White_Box_Behavioral_Detection_via_Persona_Vector_Extraction.pdf

Files (1.7 MB)

Related works

Software

Safety Lens: White-Box Behavioral Alignment Detection in Language Models via Persona Vector Extraction

Authors/Creators

Description

Files

Maio,Anthony_D.2026_Safety_Lens_White_Box_Behavioral_Detection_via_Persona_Vector_Extraction.pdf

Files (1.7 MB)

Additional details

Related works

Software