Published March 16, 2026 | Version v1
Conference paper Open

Safety Beyond the Interface: Detecting Harm via Latent LLM States

  • 1. Wrynx Inc

Description

External guardrails for LLM safety add latency and compute overhead while remaining blind to internal model reasoning. We ask: does the model already know when content is harmful? We extract activations from LLaMA-3.1-8B and train lightweight MLP classifier probes (12.6M parameters) to detect harmful prompts. Evaluated on WildJailbreak, Beavertails, and AEGIS 2.0, our probes achieve F1 scores of 99%, 83%, and 84% respectively competitive with 1000×+ larger guard models while cutting latency and compute costs.

Files

latent_space_probes_preprint.pdf

Files (5.5 MB)

Name Size Download all
md5:c970d25adc61bbf3a71eebdd102a3042
5.5 MB Preview Download