Reading Behavior from the Inside: Length-Residualized Behavioral Probes for Zero-Shot Hallucination and Deception Detection Across Model Architectures

Napolitano, Logan Matthew

doi:10.5281/zenodo.20808218

Published June 23, 2026 | Version v1

Preprint Open

Reading Behavior from the Inside: Length-Residualized Behavioral Probes for Zero-Shot Hallucination and Deception Detection Across Model Architectures

Napolitano, Logan Matthew

We present a behavioral-audit method that reads a language model's internal residual-stream activations to detect undesired behaviors — hallucination, deception, manipulation, and sycophancy — and to certify fine-tunes that reduce those behaviors without degrading capability. The method combines a raw (un-normalized) sparse-autoencoder dictionary, a 16-dimensional fiber-bundle projection with response-length residualization, and adversarially trained linear probes. On held-out data with response length matched between classes — so a probe cannot cheat on answer length — the hallucination and deception probes reach AUC 0.94–0.998 against a permutation null near 0.53. A single anchor probe trained once on Llama-3.3-70B transfers zero-shot to twelve cryptographically signed architectures (410M–72B parameters), including state-space (Mamba) and attention-free (RWKV) designs, via a label-free per-model normalizer that requires no target-side labels or gradient training. A certified anti-hallucination fine-tune reduces confident-wrong output by up to 85.8% while a separate, signed capability measurement confirms the underlying task ability is preserved. We extend the protocol to recurrent world models in robotics simulators and to vision and genomics shortcut removal. Throughout, we preserve and sign honest negatives — including a world-model environment where the raw observation outperforms the latent state, and a feature-ablation control that proves insufficient. Every figure is backed by a dual-Ed25519-signed, independently reproducible artifact.

Files

Proprioceptor_Probe_Research_Paper.pdf

Files (365.1 kB)

Name	Size	Download all
Proprioceptor_Probe_Research_Paper.pdf md5:9be6091553843ab4be37cd470e71299f	365.1 kB	Preview Download

	All versions	This version
Views	113	113
Downloads	27	27
Data volume	10.2 MB	10.2 MB

Reading Behavior from the Inside: Length-Residualized Behavioral Probes for Zero-Shot Hallucination and Deception Detection Across Model Architectures

Authors/Creators

Description

Files

Proprioceptor_Probe_Research_Paper.pdf

Files (365.1 kB)