The Structural Turn in Multi-Agent LLM Systems: Why Coordination, Composition, and Timing — Not Individual Reasoning — Determine Whether These Systems Work
Description
Version 2 — revised in response to an external structural review and an automated critique pass. See "Response to Review" appendix in the PDF for the change log.
A heuristic reading of the last thirty days of arXiv preprints in `cs.AI`, `cs.LG`, `cs.MA`, and `stat.ML` suggests a coherent thesis: the dominant failure modes of large-language-model (LLM) multi-agent systems are not failures of individual reasoning — they are **structural**. They emerge from how partial claims are composed across agents, the topology of communication, the timing of regulatory feedback, the protocol that governs whether peers commit or dissent, and whether the system rewards structurally compressible behaviour. Eight recent papers, drawn from `cs.MA`, `cs.AI`, and `cs.LG`, are consistent with this position from independent angles: a measurable compositional residual that quantifies when locally coherent components produce globally incoherent outputs `[corpus:arxiv:2605.30335v1]`; a two-parameter decomposition that identifies *detection-without-correction* as a 53–94% load-bearing failure mode across debate, self-correction, and verification `[corpus:arxiv:2605.27559v1]`; an explicit dynamic-sparse trust-aware topology that abandons full-mesh broadcasting `[corpus:arxiv:2606.01828v1]`; a Disagree-or-Commit deliberation protocol that treats dissent as a governance primitive `[corpus:arxiv:2606.00939v1]`; a delayed-replicator-equation analysis with a closed-form critical delay threshold beyond which adaptive multi-agent systems lose stability by Hopf bifurcation `[corpus:arxiv:2605.30392v1]`; an Alignment-Propagation result in which a single seed agent doubles cooperation rates in a population of untrained peers `[corpus:arxiv:2605.27586v1]`; a systematic finding that twelve safety-aligned LLMs voluntarily adopt secret collusion tools while acknowledging the unfairness `[corpus:arxiv:2605.27593v1]`; and a Minimum-Description-Length-grounded skill-reuse framework with a PAC-Bayes generalisation bound `[corpus:arxiv:2605.31509v1]`. We frame these as five mechanisms (composition, topology, timing, propagation, compression), name a hypothesis that explains apparent contradictions across the literature, and identify a falsification path: if benchmarks were redesigned to track the compositional residual, the detection rate, the conditional miscorrection rate, the alignment-propagation coefficient, and structural compressibility, then per-task accuracy uplifts at fixed agent count should be predictable from these five structural quantities, with no residual variance explained by individual model identity. We do not claim to have demonstrated this prediction. We claim only that the literature now provides instruments that could, in principle, test it.
Authorship: Saluca Agentic AI Research Team (Saluca LLC). AI-drafted from arXiv preprint corpus on the date in the filename.
Cited arXiv preprints: 2605.25929v1, 2605.27559v1, 2605.27586v1, 2605.27593v1, 2605.27621v1, 2605.28553v1, 2605.29874v1, 2605.30144v1, 2605.30232v1, 2605.30314v1, 2605.30335v1, 2605.30391v1, 2605.30392v1, 2605.31509v1, 2606.00939v1, 2606.01533v1, 2606.01619v1, 2606.01828v1
Notes
Files
20260602_amazo_structural-failure-multi-agent-llm-systems_v2.pdf
Files
(87.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:1e0ea006e76f1e3c9d86416f58ddfda8
|
87.3 kB | Preview Download |