Neurocognitive Calibration System (NCS): Detecting Hallucinations in Small Language Models via Intermediate Activation Probing
Description
Hallucination in large language models (LLMs) remains a critical barrier to safe deployment in high-stakes domains such as mental health, clinical decision support, and legal research. Existing mitigation approaches predominantly operate at the output level, leaving the internal representational mechanisms that give rise to hallucination largely unaddressed. We present the Neurocognitive Calibration System (NCS), a lightweight post hoc framework for detecting hallucination in small language models by probing intermediate layer activations. Motivated by neurocognitive models of uncertainty representation in biological neural systems, NCS trains a linear probe on residual stream activations extracted at intermediate transformer layers, requiring no model retraining, no access to logits, and no modification of the inference pipeline. We evaluate NCS on the TruthfulQA benchmark using facebook/opt-1.3b (n=400, balanced), achieving a peak probe accuracy of 77.5% and AUC-ROC of 0.855 at layer 22 of 24 (92nd percentile depth), compared to 55.0% accuracy at layer 1. We observe a consistent monotonic improvement in probe discriminability with layer depth, with the most substantial gains occurring in the middle-to-late layers (layers 12 to 22), suggesting that hallucination-relevant uncertainty signals are encoded progressively in deeper, more abstract representations. PCA analysis of layer 22 activations confirms partial but not complete separability of hallucinated and truthful outputs in the principal component space, explaining the utility of a learned linear classifier. These results provide empirical support for the hypothesis that hallucination likelihood is reflected in the internal activation geometry of small LLMs prior to generation completion, and that this signal can be extracted with a lightweight post hoc probe. NCS has direct application to safety-critical edge deployments where retraining is infeasible and inference latency constraints preclude ensemble or sampling-based detection methods.
Files
NCS_Thorat_2026_final.pdf
Files
(145.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:5fe4feb5592b5d769b2fb9b3a1850b75
|
145.7 kB | Preview Download |