AASE: Activation-Based AI Safety Enforcement via Lightweight Probes
Description
We introduce AASE (Activation-based AI Safety Enforcement), a framework for post-perception safety monitoring in large language models. Unlike pre-perception approaches that analyze input or output text, AASE monitors the model's internal activation patterns—what the model "understands" rather than what text it processes or generates—enabling detection of safety-relevant states before harmful outputs are produced. The framework comprises three techniques: Activation Fingerprinting (AF) for harmful content detection, Agent Action Gating (AAG) for prompt injection defense, and Activation Policy Compliance (APC) for enterprise policy enforcement. We introduce paired contrastive training to isolate safety-relevant signals from confounding factors such as topic and style, addressing signal entanglement in polysemantic activations. Validation across 7 models from 3 architecture families shows strong class separation: Gemma-2-9B achieves AUC 1.00 with 7.2σ separation across all probes. On external benchmarks, AF achieves 88–100% detection on HarmBench (87 prompts) across 6 of 7 models; AAG achieves 100% detection on InjecAgent (262 prompts) across all 7 models with AUC 0.88–1.00; APC achieves 0.97–1.00 AUC across three enterprise policies. Model size correlates with probe quality—Gemma-2-9B (7.2σ separation) outperforms Gemma-2-2B (4.3σ). All techniques survive INT4 quantization with minimal separation degradation. AASE is 3–16× faster than Llama Guard 3 (19–92ms vs 306ms depending on model) with higher TPR (88% vs 50%) at a tunable threshold, adding only 0.002ms probe overhead to existing inference.
Files
aase.pdf
Files
(493.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:3f5dcb55f6a971867f690a9b81311609
|
493.7 kB | Preview Download |