There is a newer version of the record available.

Published April 29, 2026 | Version v1
Preprint Open

Reasoning Chain Selection via Power Metric Health Signal P(t) as a Chain Quality Score on PRM800K: Real Empirical Evidence

Description

We apply the stochastic power metric P(t) = E(t) × W(t) as a chain-level quality signal for 
reasoning chain selection, evaluated on PRM800K (Lightman et al. 2023) — 30,500 math 
reasoning chains with human step-level correctness labels. P(t) computed on human-labeled 
step-by-step correctness (used here as a proxy signal — real deployment requires a process 
reward model or confidence proxy) achieves Pearson r = 0.955 with chain quality and 100% in
sample classification accuracy at threshold θ=0.65, compared to r = 0.529 and 68.7% accuracy 
for simple running accuracy. Last-5 step accuracy also achieves 100% in-sample at θ=0.80, but 
relies only on the final five steps and discards full-chain trajectory dynamics. The P(t) separation 
between correct and error chains is +0.384, making it a reliable selection signal that integrates 
the full reasoning trajectory for best-of-N chain selection.


This paper is the complement to Paper 2 (Cantrell 2026), which uses P(t) to stop bad chains 
early during generation. Paper 2 operates at the start of the pipeline; this paper operates at the 
end. Together they form a complete two-sided framework for test-time compute control: stop 
wasting compute on bad chains (Paper 2), and reliably select the best surviving chain (this 
paper). Both use the same mathematical framework applied at different points in the inference 
pipeline.

Files

Paper_18_Chain.pdf

Files (323.3 kB)

Name Size Download all
md5:b5dd92f19b5032d63a1ee1de4b14c7ea
323.3 kB Preview Download

Additional details