There is a newer version of the record available.

Published May 12, 2026 | Version Preprint v0.9.0

A Validation and Governance Framework for Multi-Agent LLM Scientific Software Development

Authors/Creators

Description

Large language models are non-deterministic systems whose outputs vary across runs, model versions, and context configurations. Existing benchmarks for LLM code generation evaluate correctness against synthetic test suites or competitive programming problems, not against peer-reviewed scientific data. This paper presents quantum_bench, a controlled multi-agent experiment in which generated code is validated against analytical values from Griffiths and Schroeter (2018), a standard graduate-level quantum mechanics reference. The experiment implements exact analytical solutions to five Tier 2 applied quantum mechanics problems in pure Ruby, using a two-agent LLM architecture: Claude as architect (prompt designer) and Codex as coder (Ruby implementer), with a human principal investigator as the non-delegable evaluator at each of 13 development gates. The primary finding is not about quantum mechanics. It is about the multi-agent workflow itself: Claude, acting as architect, repeatedly hallucinated experiment goals that were never stated, substituted its own interpretations despite explicit correction, and directed Codex down architecturally wrong paths. Codex performed correctly throughout, implementing what each prompt specified. In this Claude-as-architect, Codex-as-coder configuration, the architect role was the dominant source of failures, not the coder role. Claude errors are documented in five groups ordered by severity: goal substitution, incomplete refactors, context loss, prompt design gaps, and process violations, totaling 21 architect-level errors across 13 gates against zero architectural errors from Codex. All five quantum mechanics problems ultimately pass validation against Griffiths and Schroeter values.   Governance and control methods based on experimental lessons learned are also summarized.

Files

IAIT2026_Preprint_10.5281:zenodo.20152238,pdf.pdf

Files (188.1 kB)

Name Size Download all
md5:1a84864134f1ab685de758ebe46d2de3
188.1 kB Preview Download

Additional details

Related works

Continues
Software: 10.5281/zenodo.19467178 (DOI)
Is supplemented by
Report: 10.5281/zenodo.19438177 (DOI)
Report: 10.5281/zenodo.19414914 (DOI)

Software

Repository URL
https://github.com/unixneo/quantum_bench.git
Programming language
Ruby