Reading Behavior from the Inside: Length-Residualized Behavioral Probes for Zero-Shot Hallucination and Deception Detection Across Model Architectures
Authors/Creators
Description
We present a behavioral-audit method that reads a language model's internal residual-stream activations to detect undesired behaviors — hallucination, deception, manipulation, and sycophancy — and to certify fine-tunes that reduce those behaviors without degrading capability. The method combines a raw (un-normalized) sparse-autoencoder dictionary, a 16-dimensional fiber-bundle projection with response-length residualization, and adversarially trained linear probes. On held-out data with response length matched between classes — so a probe cannot cheat on answer length — the hallucination and deception probes reach AUC 0.94–0.998 against a permutation null near 0.53. A single anchor probe trained once on Llama-3.3-70B transfers zero-shot to twelve cryptographically signed architectures (410M–72B parameters), including state-space (Mamba) and attention-free (RWKV) designs, via a label-free per-model normalizer that requires no target-side labels or gradient training. A certified anti-hallucination fine-tune reduces confident-wrong output by up to 85.8% while a separate, signed capability measurement confirms the underlying task ability is preserved. We extend the protocol to recurrent world models in robotics simulators and to vision and genomics shortcut removal. Throughout, we preserve and sign honest negatives — including a world-model environment where the raw observation outperforms the latent state, and a feature-ablation control that proves insufficient. Every figure is backed by a dual-Ed25519-signed, independently reproducible artifact.
Files
Proprioceptor_Probe_Research_Paper.pdf
Files
(365.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:9be6091553843ab4be37cd470e71299f
|
365.1 kB | Preview Download |