A Validation and Governance Framework for Multi-Agent LLM Scientific Software Development

Bass, Tim

doi:10.5281/zenodo.20152238

Published May 12, 2026 | Version Preprint v0.9.0

Preprint Open

A Validation and Governance Framework for Multi-Agent LLM Scientific Software Development

Bass, Tim

Large language models are non-deterministic systems whose outputs vary across runs, model versions, and context configurations. Existing benchmarks for LLM code generation evaluate correctness against synthetic test suites or competitive programming problems, not against peer-reviewed scientific data. This paper presents quantum_bench, a controlled multi-agent experiment in which generated code is validated against analytical values from Griffiths and Schroeter (2018), a standard graduate-level quantum mechanics reference. The experiment implements exact analytical solutions to five Tier 2 applied quantum mechanics problems in pure Ruby, using a two-agent LLM architecture: Claude as architect (prompt designer) and Codex as coder (Ruby implementer), with a human principal investigator as the non-delegable evaluator at each of 13 development gates. The primary finding is not about quantum mechanics. It is about the multi-agent workflow itself: Claude, acting as architect, repeatedly hallucinated experiment goals that were never stated, substituted its own interpretations despite explicit correction, and directed Codex down architecturally wrong paths. Codex performed correctly throughout, implementing what each prompt specified. In this Claude-as-architect, Codex-as-coder configuration, the architect role was the dominant source of failures, not the coder role. Claude errors are documented in five groups ordered by severity: goal substitution, incomplete refactors, context loss, prompt design gaps, and process violations, totaling 21 architect-level errors across 13 gates against zero architectural errors from Codex. All five quantum mechanics problems ultimately pass validation against Griffiths and Schroeter values. Governance and control methods based on experimental lessons learned are also summarized.

Files

IAIT2026_Preprint_10.5281:zenodo.20152238,pdf.pdf

Files (188.1 kB)

Name	Size	Download all
IAIT2026_Preprint_10.5281:zenodo.20152238,pdf.pdf md5:1a84864134f1ab685de758ebe46d2de3	188.1 kB	Preview Download

Additional details

Continues: Software: 10.5281/zenodo.19467178 (DOI)
Is supplemented by: Report: 10.5281/zenodo.19438177 (DOI); Report: 10.5281/zenodo.19414914 (DOI)

Repository URL: https://github.com/unixneo/quantum_bench.git
Programming language: Ruby

	All versions	This version
Views	147	34
Downloads	85	21
Data volume	31.2 MB	6.8 MB

A Validation and Governance Framework for Multi-Agent LLM Scientific Software Development

Authors/Creators

Description

Files

IAIT2026_Preprint_10.5281:zenodo.20152238,pdf.pdf

Files (188.1 kB)

Additional details

Related works

Software