Neurocognitive Calibration System (NCS): Detecting Hallucinations in Small Language Models via Intermediate Activation Probing

Thorat, Mayuri

doi:10.5281/zenodo.20471223

Published May 16, 2026 | Version v4

Preprint Open

Neurocognitive Calibration System (NCS): Detecting Hallucinations in Small Language Models via Intermediate Activation Probing

Thorat, Mayuri (Researcher)¹

1. SerenMind AI

Hallucination in large language models (LLMs) remains a critical barrier to safe deployment in high-stakes domains such as mental health, clinical decision support, and legal research. Existing mitigation approaches predominantly operate at the output level, leaving the internal representational mechanisms that give rise to hallucination largely unaddressed. We present the Neurocognitive Calibration System (NCS), a lightweight post hoc framework for detecting hallucination in small language models by probing intermediate layer activations. Motivated by neurocognitive models of uncertainty representation in biological neural systems, NCS trains a linear probe on residual stream activations extracted at intermediate transformer layers, requiring no model retraining, no access to logits, and no modification of the inference pipeline. We evaluate NCS on the TruthfulQA benchmark using facebook/opt-1.3b (n=400, balanced), achieving a peak probe accuracy of 77.5% and AUC-ROC of 0.855 at layer 22 of 24 (92nd percentile depth), compared to 55.0% accuracy at layer 1. We observe a consistent monotonic improvement in probe discriminability with layer depth, with the most substantial gains occurring in the middle-to-late layers (layers 12 to 22), suggesting that hallucination-relevant uncertainty signals are encoded progressively in deeper, more abstract representations. PCA analysis of layer 22 activations confirms partial but not complete separability of hallucinated and truthful outputs in the principal component space, explaining the utility of a learned linear classifier. These results provide empirical support for the hypothesis that hallucination likelihood is reflected in the internal activation geometry of small LLMs prior to generation completion, and that this signal can be extracted with a lightweight post hoc probe. NCS has direct application to safety-critical edge deployments where retraining is infeasible and inference latency constraints preclude ensemble or sampling-based detection methods.

Files

NCS_Thorat_2026_final.pdf

Files (145.7 kB)

Name	Size	Download all
NCS_Thorat_2026_final.pdf md5:5fe4feb5592b5d769b2fb9b3a1850b75	145.7 kB	Preview Download

Additional details

Repository URL: https://github.com/thomayuri-ma/Neurocognitive-Calibration-System-NCS-

	All versions	This version
Views	20	10
Downloads	15	3
Data volume	15.4 MB	582.9 kB

Neurocognitive Calibration System (NCS): Detecting Hallucinations in Small Language Models via Intermediate Activation Probing

Authors/Creators

Description

Files

NCS_Thorat_2026_final.pdf

Files (145.7 kB)

Additional details

Software