EPOB: End-to-End Project Orchestration Benchmark - Evaluating Multi-Agent AI Frameworks

Scott, Nathan

doi:10.5281/zenodo.20349946

Published May 23, 2026 | Version v1

Preprint Open

EPOB: End-to-End Project Orchestration Benchmark - Evaluating Multi-Agent AI Frameworks

Scott, Nathan

Existing evaluations of LLM-based agents emphasize single-task capability, such as question answering, code generation, tool use, or bounded workflow completion. Many real deployments, however, require a broader systems capability: executing a project through decomposition, delegation, review, rework, and final delivery. In these settings, systems built on the same base model can behave very differently because they orchestrate work differently. We introduce the End-to-End Project Orchestration Benchmark (EPOB), a framework-centric benchmark for evaluating how effectively multi-agent AI systems execute the lifecycle of structured projects. EPOB measures five dimensions of orchestration quality: Plan Quality, Assignment Quality, Coordination, Deliverable Quality, and Efficiency. To support reproducible and diagnostically useful evaluation, we define a project-instance schema, an execution-trace schema, a judge-report schema, a hybrid rubric-plus-judge protocol, a seed task suite spanning ten scenario families, a ten-family live baseline slice, an initial same-model, same-resource baseline design, and a completed 180-cell three-family model-matched robustness package over four model endpoints and five frameworks. EPOB is intended not only as a ranking instrument, but as a methodology for exposing failure modes in planning, delegation, handoff, review, and recovery that outcome-only evaluation does not reveal.

Files

epob_supplementary.zip

Files (2.7 MB)

Name	Size	Download all
epob_supplementary.zip md5:49cfaad7dbaf9c1c9b78e28be9148eaf	2.1 MB	Preview Download
epob_v1.pdf md5:b06029f820f268168c86f3184d662cf7	623.7 kB	Preview Download

	All versions	This version
Views	90	90
Downloads	13	13
Data volume	11.0 MB	11.0 MB

EPOB: End-to-End Project Orchestration Benchmark - Evaluating Multi-Agent AI Frameworks

Authors/Creators

Description

Files

epob_supplementary.zip

Files (2.7 MB)