Published May 23, 2026 | Version v1
Preprint Open

EPOB: End-to-End Project Orchestration Benchmark - Evaluating Multi-Agent AI Frameworks

Authors/Creators

Description

Existing evaluations of LLM-based agents emphasize single-task capability, such as question answering, code generation, tool use, or bounded workflow completion. Many real deployments, however, require a broader systems capability: executing a project through decomposition, delegation, review, rework, and final delivery. In these settings, systems built on the same base model can behave very differently because they orchestrate work differently. We introduce the End-to-End Project Orchestration Benchmark (EPOB), a framework-centric benchmark for evaluating how effectively multi-agent AI systems execute the lifecycle of structured projects. EPOB measures five dimensions of orchestration quality: Plan Quality, Assignment Quality, Coordination, Deliverable Quality, and Efficiency. To support reproducible and diagnostically useful evaluation, we define a project-instance schema, an execution-trace schema, a judge-report schema, a hybrid rubric-plus-judge protocol, a seed task suite spanning ten scenario families, a ten-family live baseline slice, an initial same-model, same-resource baseline design, and a completed 180-cell three-family model-matched robustness package over four model endpoints and five frameworks. EPOB is intended not only as a ranking instrument, but as a methodology for exposing failure modes in planning, delegation, handoff, review, and recovery that outcome-only evaluation does not reveal.

Files

epob_supplementary.zip

Files (2.7 MB)

Name Size Download all
md5:49cfaad7dbaf9c1c9b78e28be9148eaf
2.1 MB Preview Download
md5:b06029f820f268168c86f3184d662cf7
623.7 kB Preview Download