Published October 6, 2025 | Version v2

Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

  • 1. RAI - Radiotelevisione Italiana
  • 2. RAI - CRITS

Description

Even when decoding with temperature T=0, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of background temperature Tbg, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal T=0. We provide clean definitions, show how Tbg relates to a stochastic perturbation governed by the inference environment I, and propose an empirical protocol to estimate Tbg via the equivalent temperature Tn(I) of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.

Files

Background_Temperature_in_LLMs___arxiV.pdf

Files (854.3 kB)

Name Size Download all
md5:8e0e444d2d762a190436e75d6ff21a2e
854.3 kB Preview Download

Additional details

References

  • Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, and Breck Baldwin. Non-determinism of "deterministic" llm settings. arXiv, 2408.04667, 2025.
  • Horace He and Thinking Machines Lab. Defeating nondeterminism in llm inference. Thinking Machines Lab blog, 2025.
  • Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics
  • Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. An empirical study of the non-determinism of chatgpt in code generation. In arXiv preprint, volume 2308.02828
  • S. Price and D. L. Cote. Document analysis with llms: Assessing performance, bias, and nondeterminism in decision making. In ICPRAM 2025: Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods, pages 207–214, 202
  • Nikita Ravi, Abhinav Goel, James C. Davis, and George K. Thiruvathukal. Improving the reproducibility of deep learning software: An initial investigation through a case study analysis. arXiv preprint, arXiv:2505.03165, 2025
  • Sanjif Shanmugavelu, Mathieu Taillefumier, Christopher Culver, Oscar Hernandez, Mark Coletti, and Ada Sedova. Impacts of floating-point non-associativity on reproducibility for hpc and deep learning applications. arXiv preprint, arXiv:2408.05148, 2024
  • Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. Evaluation of llms should not ignore non-determinism. arXiv, 2407.10457, 2024