Published June 23, 2026 | Version v1

Reading Behavior from the Inside: Length-Residualized Behavioral Probes for Zero-Shot Hallucination and Deception Detection Across Model Architectures

Authors/Creators

Description

We present a behavioral-audit method that reads a language model's internal residual-stream activations to detect undesired behaviors — hallucination, deception, manipulation, and sycophancy — and to certify fine-tunes that reduce those behaviors without degrading capability. The method combines a raw (un-normalized) sparse-autoencoder dictionary, a 16-dimensional fiber-bundle projection with response-length residualization, and adversarially trained linear probes. On held-out data with response length matched between classes — so a probe cannot cheat on answer length — the hallucination and deception probes reach AUC 0.94–0.998 against a permutation null near 0.53. A single anchor probe trained once on Llama-3.3-70B transfers zero-shot to twelve cryptographically signed architectures (410M–72B parameters), including state-space (Mamba) and attention-free (RWKV) designs, via a label-free per-model normalizer that requires no target-side labels or gradient training. A certified anti-hallucination fine-tune reduces confident-wrong output by up to 85.8% while a separate, signed capability measurement confirms the underlying task ability is preserved. We extend the protocol to recurrent world models in robotics simulators and to vision and genomics shortcut removal. Throughout, we preserve and sign honest negatives — including a world-model environment where the raw observation outperforms the latent state, and a feature-ablation control that proves insufficient. Every figure is backed by a dual-Ed25519-signed, independently reproducible artifact. 

Files

Proprioceptor_Probe_Research_Paper.pdf

Files (365.1 kB)

Name Size Download all
md5:9be6091553843ab4be37cd470e71299f
365.1 kB Preview Download