EPOB: End-to-End Project Orchestration Benchmark - Evaluating Multi-Agent AI Frameworks
Authors/Creators
Description
Existing evaluations of LLM-based agents emphasize single-task capability, such as question answering, code generation, tool use, or bounded workflow completion. Many real deployments, however, require a broader systems capability: executing a project through decomposition, delegation, review, rework, and final delivery. In these settings, systems built on the same base model can behave very differently because they orchestrate work differently. We introduce the End-to-End Project Orchestration Benchmark (EPOB), a framework-centric benchmark for evaluating how effectively multi-agent AI systems execute the lifecycle of structured projects. EPOB measures five dimensions of orchestration quality: Plan Quality, Assignment Quality, Coordination, Deliverable Quality, and Efficiency. To support reproducible and diagnostically useful evaluation, we define a project-instance schema, an execution-trace schema, a judge-report schema, a hybrid rubric-plus-judge protocol, a seed task suite spanning ten scenario families, a ten-family live baseline slice, an initial same-model, same-resource baseline design, and a completed 180-cell three-family model-matched robustness package over four model endpoints and five frameworks. EPOB is intended not only as a ranking instrument, but as a methodology for exposing failure modes in planning, delegation, handoff, review, and recovery that outcome-only evaluation does not reveal.
Files
epob_supplementary.zip
Files
(2.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:49cfaad7dbaf9c1c9b78e28be9148eaf
|
2.1 MB | Preview Download |
|
md5:b06029f820f268168c86f3184d662cf7
|
623.7 kB | Preview Download |