Published March 13, 2026 | Version 1.1
Preprint Open

STEM Truth Oracle: Log-Probability Multiple-Choice Ranking Reveals and Corrects Scale-Invariant Factual Biases

Authors/Creators

  • 1. Independent

Description

We study a systematic failure mode in language models: when the true answer to a STEM question is surprising relative to training-data priors, models prefer plausible-sounding distractors over the correct answer. We build a 97-fact STEM benchmark spanning six domains (calculus, physics, chemistry, statistics, linear algebra, constants) and evaluate six models from GPT-2 (117M) to Qwen3-4B using log-probability multiple-choice ranking. Accuracy rises from 16% to 77% with scale, but systematic errors persist even at 4B parameters. We identify four scale-invariant bias patterns (positivity, linearity, missing-constant, truncation) that appear at all scales. A transfer matrix experiment shows zero cross-pattern generalization from single-pattern adapters; mixed training achieves 70-100% per-pattern accuracy. Log-probability margin is a perfect binary oracle: positive margin predicts correct answer with 100% precision and recall on the 40-fact probe set. Margin magnitude tracks domain difficulty.

v1.1 changes: Expanded limitations section, replaced informal self-references with DOI citations, strengthened abstract opening, added GitHub link.

Notes

Part of the rho-eval / knowledge-fidelity research program. Paper 9 of 9. Code: https://github.com/SolomonB14D3/knowledge-fidelity

Files

stem_truth_oracle.pdf

Files (502.4 kB)

Name Size Download all
md5:3d97c93e936d25538c69d0c202d2caeb
502.4 kB Preview Download

Additional details

Related works