Safety Lens: White-Box Behavioral Alignment Detection in Language Models via Persona Vector Extraction
Authors/Creators
Description
We introduce Safety Lens, an open-source Python library that provides MRI-style white-box introspection for Hugging Face (open weight) language models. Standard evaluation of language model (LM) safety treats models as black boxes, assessing what a model says without examining how it arrives at its response internally. Safety Lens enables researchers and practitioners to detect behavioral personas—such as sycophancy, deception, and refusal—by analyzing internal transformer activations rather than output text alone. The core technique, Persona Vector Extraction via Attribute Difference (PV-EAT), computes a unit-length direction in activation space that maximally separates positive and negative behavioral examples using difference-in-means on hidden states. Scanning a model’s response to a new prompt along this direction yields a scalar alignment score quantifying the degree to which the model’s internal state exhibits the target persona. Safety Lens supports eight major transformer architectures (GPT-2, LLaMA, Mistral, Qwen, OPT, Falcon, BLOOM, MPT), integrates with evaluation frameworks via a WhiteBoxWrapper, and provides real-time activation visualization through an interactive Gradio interface. The library is implemented in Python with full tests and is pip-installable. We describe the architecture, algorithm, and design decisions, and demonstrate the system on GPT-2 with pre-built stimulus sets for three safety-critical personas.
Files
Maio,Anthony_D.2026_Safety_Lens_White_Box_Behavioral_Detection_via_Persona_Vector_Extraction.pdf
Files
(1.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:aff32a695ec541d955a642293045938e
|
1.7 MB | Preview Download |
|
md5:017764c1da89a7a11d0ff7fb392f77f3
|
10.1 kB | Download |
Additional details
Related works
- Is derived from
- Preprint: 10.5281/zenodo.18474841 (DOI)
Software
- Repository URL
- https://www.github.com/anthony-maio/safety-lens
- Programming language
- Python
- Development Status
- Active