Published March 16, 2026
| Version v1
Conference paper
Open
Safety Beyond the Interface: Detecting Harm via Latent LLM States
Description
External guardrails for LLM safety add latency and compute overhead while remaining blind to internal model reasoning. We ask: does the model already know when content is harmful? We extract activations from LLaMA-3.1-8B and train lightweight MLP classifier probes (12.6M parameters) to detect harmful prompts. Evaluated on WildJailbreak, Beavertails, and AEGIS 2.0, our probes achieve F1 scores of 99%, 83%, and 84% respectively competitive with 1000×+ larger guard models while cutting latency and compute costs.
Files
latent_space_probes_preprint.pdf
Files
(5.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:c970d25adc61bbf3a71eebdd102a3042
|
5.5 MB | Preview Download |