A Validation and Governance Framework for Multi-Agent LLM Scientific Software Development
Authors/Creators
Description
Large language models are non-deterministic systems whose outputs vary across runs, model versions, and context configurations. Existing benchmarks for LLM code generation evaluate correctness against synthetic test suites or competitive programming problems, not against peer-reviewed scientific data. This paper presents quantum_bench, a controlled multi-agent experiment in which generated code is validated against analytical values from Griffiths and Schroeter (2018), a standard graduate-level quantum mechanics reference. The experiment implements exact analytical solutions to five Tier 2 applied quantum mechanics problems in pure Ruby, using a two-agent LLM architecture: Claude as architect (prompt designer) and Codex as coder (Ruby implementer), with a human principal investigator as the non-delegable evaluator at each of 13 development gates. The primary finding is not about quantum mechanics. It is about the multi-agent workflow itself: Claude, acting as architect, repeatedly hallucinated experiment goals that were never stated, substituted its own interpretations despite explicit correction, and directed Codex down architecturally wrong paths. Codex performed correctly throughout, implementing what each prompt specified. In this Claude-as-architect, Codex-as-coder configuration, the architect role was the dominant source of failures, not the coder role. Claude errors are documented in five groups ordered by severity: goal substitution, incomplete refactors, context loss, prompt design gaps, and process violations, totaling 21 architect-level errors across 13 gates against zero architectural errors from Codex. All five quantum mechanics problems ultimately pass validation against Griffiths and Schroeter values. Governance and control methods based on experimental lessons learned are also summarized.
Files
IAIT2026_Preprint_10.5281:zenodo.20152238,pdf.pdf
Files
(188.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:1a84864134f1ab685de758ebe46d2de3
|
188.1 kB | Preview Download |
Additional details
Related works
- Continues
- Software: 10.5281/zenodo.19467178 (DOI)
- Is supplemented by
- Report: 10.5281/zenodo.19438177 (DOI)
- Report: 10.5281/zenodo.19414914 (DOI)
Software
- Repository URL
- https://github.com/unixneo/quantum_bench.git
- Programming language
- Ruby