Making LLMs Say What They Know: Probe-Targeted Fine-Tuning for Verbal Confidence Calibration

Cacioli, Jon-Paul

doi:10.5281/zenodo.20436841

Published May 29, 2026 | Version 1

Preprint Open

Making LLMs Say What They Know: Probe-Targeted Fine-Tuning for Verbal Confidence Calibration

Cacioli, Jon-Paul (Contact person)

Background: Verbal confidence in instruct-tuned large language models fails because the readout pathway from internal correctness representations to the confidence-token position transmits little of the available signal. Linear probes on hidden states discriminate correct from incorrect responses at AUROC2 = 0.76–0.88 across seven of eight models spanning four families and three scales (the exception, Qwen 72B, is a probe outlier), yet verbal confidence saturates near ceiling (mean ~95–99%).

Objectives: We introduce probe-targeted confidence-calibrated supervised fine-tuning (PT-CSFT), which uses a linear probe on a model's own hidden states to generate continuous confidence targets for LoRA fine-tuning. We evaluate whether this method closes the metacognitive gap across model families, scales, and cognitive domains.

Methods: PT-CSFT modifies attention key and output projections and the three MLP projections via LoRA while leaving queries, values, and the language model head unchanged. We evaluate on eight models (7B–72B, four families) across TriviaQA, GSM8K, and ARC-Challenge. A two-stage curriculum addresses the 70B regime. The logit readout reads softmax distributions over confidence tokens as a continuous signal. Pre-registration: OSF Phase 1 (https://osf.io/ngkwc/), Phase 0 (https://osf.io/mpcr5).

Results: PT-CSFT recovers 91–115% of probe discrimination in verbal confidence at 7–32B. At 70B, a two-stage curriculum closes 66% of the gap on the logit channel (AUROC2 = 0.797), the first VRS-Valid confidence signal at 70B. On GSM8K, the logit readout achieves AUROC2 = 0.862 ± 0.013 across 10 seeds, exceeding the probe. Post-hoc Platt scaling yields ECE = 0.042. The signal transfers out-of-distribution (NQ: 0.757, PopQA: 0.834) and enables confidence-gated retrieval (2.17x accuracy differential).

Conclusions: Controlled activation patching at the confidence-token position supports the interpretation that verbal confidence failure is a routing problem. The intervention is position-specific (mid-question control: chance), bidirectional (91% forward, 89% reverse), selective (83% of answers unchanged), and follows a near-monotonic layer-depth gradient (Spearman rho = 0.976, p < 1e-4). These observations are consistent with PT-CSFT repairing a position-specific pathway for confidence without disrupting answer computation, though they do not trace the effect to specific model components. Multi-seed replication across three model families confirms stability (Llama 8B: 0.836 ± 0.011; Qwen 7B: 0.801 ± 0.022; Mistral 7B: 0.771 ± 0.017; 6 seeds each). The logit readout universally rescues where text confidence fails.

Files

ptcsft_arxiv.pdf

Files (620.4 kB)

Name	Size	Download all
ptcsft_arxiv.pdf md5:e8bcdaebe8011bb81123524a8292956c	620.4 kB	Preview Download

Additional details

Is supplemented by: Other: https://osf.io/ngkwc/ (URL); Other: https://osf.io/mpcr5 (URL); Software: https://github.com/synthiumjp/metacog-engineering (URL)

Available: 2026

Repository URL: https://github.com/synthiumjp/metacog-engineering
Programming language: Python
Development Status: Active

	All versions	This version
Views	475	475
Downloads	228	228
Data volume	164.4 MB	164.4 MB

ptcsft_arxiv.pdf

Files (620.4 kB)

Related works

Dates

Software

Making LLMs Say What They Know: Probe-Targeted Fine-Tuning for Verbal Confidence Calibration

Authors/Creators

Description

Files

ptcsft_arxiv.pdf

Files (620.4 kB)

Additional details

Related works

Dates

Software