Safety Beyond the Interface: Detecting Harm via Latent LLM States

Khatri, Alizishaan; Prabhu, Chiquita; Neogi, Omkar

doi:10.13140/RG.2.2.29352.64005

Published March 16, 2026 | Version v1

Conference paper Open

Safety Beyond the Interface: Detecting Harm via Latent LLM States

1. Wrynx Inc

External guardrails for LLM safety add latency and compute overhead while remaining blind to internal model reasoning. We ask: does the model already know when content is harmful? We extract activations from LLaMA-3.1-8B and train lightweight MLP classifier probes (12.6M parameters) to detect harmful prompts. Evaluated on WildJailbreak, Beavertails, and AEGIS 2.0, our probes achieve F1 scores of 99%, 83%, and 84% respectively competitive with 1000×+ larger guard models while cutting latency and compute costs.

Files

latent_space_probes_preprint.pdf

Files (5.5 MB)

Name	Size	Download all
latent_space_probes_preprint.pdf md5:c970d25adc61bbf3a71eebdd102a3042	5.5 MB	Preview Download

Views

Downloads

Show more details

	All versions	This version
Views	37	37
Downloads	31	31
Data volume	284.9 MB	284.9 MB

More info on how stats are collected....

DOI

Resource type

Conference paper

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: March 18, 2026
Modified: March 18, 2026

Safety Beyond the Interface: Detecting Harm via Latent LLM States

Authors/Creators

Description

Files

latent_space_probes_preprint.pdf

Files (5.5 MB)