AASE: Activation-Based AI Safety Enforcement via Lightweight Probes

Messenger, Glen

doi:10.5281/zenodo.19209689

Published March 24, 2026 | Version v1

Preprint Open

AASE: Activation-Based AI Safety Enforcement via Lightweight Probes

Messenger, Glen¹

1. Google (United States)

We introduce AASE (Activation-based AI Safety Enforcement), a framework for post-perception safety monitoring in large language models. Unlike pre-perception approaches that analyze input or output text, AASE monitors the model's internal activation patterns—what the model "understands" rather than what text it processes or generates—enabling detection of safety-relevant states before harmful outputs are produced. The framework comprises three techniques: Activation Fingerprinting (AF) for harmful content detection, Agent Action Gating (AAG) for prompt injection defense, and Activation Policy Compliance (APC) for enterprise policy enforcement. We introduce paired contrastive training to isolate safety-relevant signals from confounding factors such as topic and style, addressing signal entanglement in polysemantic activations. Validation across 7 models from 3 architecture families shows strong class separation: Gemma-2-9B achieves AUC 1.00 with 7.2σ separation across all probes. On external benchmarks, AF achieves 88–100% detection on HarmBench (87 prompts) across 6 of 7 models; AAG achieves 100% detection on InjecAgent (262 prompts) across all 7 models with AUC 0.88–1.00; APC achieves 0.97–1.00 AUC across three enterprise policies. Model size correlates with probe quality—Gemma-2-9B (7.2σ separation) outperforms Gemma-2-2B (4.3σ). All techniques survive INT4 quantization with minimal separation degradation. AASE is 3–16× faster than Llama Guard 3 (19–92ms vs 306ms depending on model) with higher TPR (88% vs 50%) at a tunable threshold, adding only 0.002ms probe overhead to existing inference.

Files

aase.pdf

Files (493.7 kB)

Name	Size	Download all
aase.pdf md5:3f5dcb55f6a971867f690a9b81311609	493.7 kB	Preview Download

	All versions	This version
Views	177	177
Downloads	162	162
Data volume	98.3 MB	98.3 MB

AASE: Activation-Based AI Safety Enforcement via Lightweight Probes

Authors/Creators

Description

Files

aase.pdf

Files (493.7 kB)