Making LLMs Say What They Know: Probe-Targeted Fine-Tuning for Verbal Confidence Calibration
Authors/Creators
Description
Background: Verbal confidence in instruct-tuned large language models fails because the readout pathway from internal correctness representations to the confidence-token position transmits little of the available signal. Linear probes on hidden states discriminate correct from incorrect responses at AUROC2 = 0.76–0.88 across seven of eight models spanning four families and three scales (the exception, Qwen 72B, is a probe outlier), yet verbal confidence saturates near ceiling (mean ~95–99%).
Objectives: We introduce probe-targeted confidence-calibrated supervised fine-tuning (PT-CSFT), which uses a linear probe on a model's own hidden states to generate continuous confidence targets for LoRA fine-tuning. We evaluate whether this method closes the metacognitive gap across model families, scales, and cognitive domains.
Methods: PT-CSFT modifies attention key and output projections and the three MLP projections via LoRA while leaving queries, values, and the language model head unchanged. We evaluate on eight models (7B–72B, four families) across TriviaQA, GSM8K, and ARC-Challenge. A two-stage curriculum addresses the 70B regime. The logit readout reads softmax distributions over confidence tokens as a continuous signal. Pre-registration: OSF Phase 1 (https://osf.io/ngkwc/), Phase 0 (https://osf.io/mpcr5).
Results: PT-CSFT recovers 91–115% of probe discrimination in verbal confidence at 7–32B. At 70B, a two-stage curriculum closes 66% of the gap on the logit channel (AUROC2 = 0.797), the first VRS-Valid confidence signal at 70B. On GSM8K, the logit readout achieves AUROC2 = 0.862 ± 0.013 across 10 seeds, exceeding the probe. Post-hoc Platt scaling yields ECE = 0.042. The signal transfers out-of-distribution (NQ: 0.757, PopQA: 0.834) and enables confidence-gated retrieval (2.17x accuracy differential).
Conclusions: Controlled activation patching at the confidence-token position supports the interpretation that verbal confidence failure is a routing problem. The intervention is position-specific (mid-question control: chance), bidirectional (91% forward, 89% reverse), selective (83% of answers unchanged), and follows a near-monotonic layer-depth gradient (Spearman rho = 0.976, p < 1e-4). These observations are consistent with PT-CSFT repairing a position-specific pathway for confidence without disrupting answer computation, though they do not trace the effect to specific model components. Multi-seed replication across three model families confirms stability (Llama 8B: 0.836 ± 0.011; Qwen 7B: 0.801 ± 0.022; Mistral 7B: 0.771 ± 0.017; 6 seeds each). The logit readout universally rescues where text confidence fails.
Files
ptcsft_arxiv.pdf
Files
(620.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:e8bcdaebe8011bb81123524a8292956c
|
620.4 kB | Preview Download |
Additional details
Related works
- Is supplemented by
- Other: https://osf.io/ngkwc/ (URL)
- Other: https://osf.io/mpcr5 (URL)
- Software: https://github.com/synthiumjp/metacog-engineering (URL)
Dates
- Available
-
2026
Software
- Repository URL
- https://github.com/synthiumjp/metacog-engineering
- Programming language
- Python
- Development Status
- Active