There is a newer version of the record available.

Published March 13, 2026 | Version v1
Preprint Open

STEM Truth Oracle: Log-Probability Multiple-Choice Ranking Reveals and Corrects Scale-Invariant Factual Biases

Authors/Creators

  • 1. Independent

Description

We study a systematic failure mode in language models: when the true answer to a STEM question is surprising relative to training-data priors, models prefer plausible-sounding distractors over the correct answer. We build a 97-fact STEM benchmark spanning six domains (calculus, physics, chemistry, statistics, linear algebra, constants) and evaluate six models from GPT-2 (117M) to Qwen3-4B using log-probability multiple-choice ranking. Accuracy rises from 16% to 77% with scale, but systematic errors persist even at 4B parameters. We identify four scale-invariant bias patterns (positivity, linearity, missing-constant, truncation) that appear at all scales. A transfer matrix experiment shows zero cross-pattern generalization from single-pattern adapters; mixed training achieves 70-100% per-pattern accuracy. Log-probability margin is a perfect binary oracle: positive margin predicts correct answer with 100% precision and recall (0 false positives, 0 false negatives on the 40-fact probe set). Margin magnitude tracks domain difficulty (statistics: mean margin -1.15, 60% accuracy; physics: +2.72, 85%). A length-normalization ablation confirms sum log-probability is preferred over mean-per-token scoring. Targeted training on stubborn facts fixes recoverable cases (1/4 fixed) and confirms that remaining failures arise from genuine data contradictions, not insufficient model capacity.

Notes

Part of the rho-eval / knowledge-fidelity research program. Paper 9 of 9. Code available at https://github.com/SolomonB14D3/knowledge-fidelity

Files

stem_truth_oracle.pdf

Files (592.0 kB)

Name Size Download all
md5:ed7c26ef1d70afdab701b77ea27ea653
592.0 kB Preview Download

Additional details

Related works