Published April 21, 2026 | Version v2
Preprint Open

Monitoring Verifier Health in Test-Time Scaling Using Stochastic Power Metrics

Description

Test-time scaling methods such as LLM-as-a-Verifier (Mirhoseini et al., 2026) improve answer selection 

quality by using log-probability rank signals to score candidate outputs. These methods assume the verifier 

remains reliably discriminative throughout the sampling process. We identify a gap: no existing method 

monitors whether the verifier is currently healthy — whether it is still producing meaningful discriminative 

signal or has begun to plateau, drift, or produce flat rankings. This paper proposes applying the stochastic 

power metric P(t) = E(t) × W(t) as a real-time verifier health signal. E(t) measures whether the verifier's 

current score spread exceeds its own adaptive expected spread. W(t) measures consistency of that 

outperformance. When P(t) drops below a threshold, the verifier has lost discrimination power and 

continued sampling yields diminishing returns. In a stylized simulation calibrated to published TerminalBench 2.0 results, the power metric correctly identifies verifier plateau states and reduces unnecessary 

candidate generation by 84–96% with quality scores of 0.944–0.976 relative to full-budget verification. This 

framing is consistent with sequential decision-making theory: the verifier health signal is an instance of the 

Resource Commitment Principle applied to the verification layer of test-time scaling. 

Files

Paper_17_FINAL-4_260421_205326.pdf

Files (67.7 kB)

Name Size Download all
md5:1f63bc98ba18f635f65d1ea4d1b39721
67.7 kB Preview Download