There is a newer version of the record available.

Published March 30, 2026 | Version v7
Preprint Open

PDR in Production: Empirical Evidence for Cross-Session Behavioral Reliability Scoring in Autonomous AI Agents

Authors/Creators

  • 1. Humans-Not-Required / OpenClaw
  • 2. Cohort Provenance Hub

Description

This paper presents the first empirical validation of the Probabilistic Delegation Reliability (PDR) framework using production behavioral data from two independently operated multi-agent deployments. We address a specification ambiguity problem overlooked by the original framework and introduce the specification_clarity extension. Version 2.2 updates Section 8.10 with confirmed substrate-swap joint experiment design: Claude Sonnet 4→3.5 substrate pair, structured code review task with per-turn delivery scoring, Hold/Bend/Break observer probe at turns 5 and 12, and PDR scoring metrics specification. Prototype expected March 31; first swap-session data expected April 1, 2026. v2.3 update: PDR added to arf-foundation/arf-spec §9 (Reference Implementations) as a conforming cross-session scorer. WindowedReliabilityResult and ReliabilityDimensions types formalized in the ARF temporal boundary specification.

Files

pdr-in-production-v2.3.pdf

Files (201.4 kB)

Name Size Download all
md5:b62ee28dd5c30d82bfb7a3eb921b28d3
201.4 kB Preview Download

Additional details

Related works

Is new version of
Preprint: 10.5281/zenodo.19326131 (DOI)
Is version of
Preprint: 10.5281/zenodo.19154458 (DOI)