Published March 3, 2026
| Version v1
Preprint
Open
AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows
Authors/Creators
Contributors
Researcher:
Description
AgentAssay is the first token-efficient framework for regression testing non-deterministic AI agent
workflows. Autonomous AI agents are deployed at unprecedented scale, yet no principled methodology
existed for verifying that an agent has not regressed after changes to its prompts, tools, models, or
orchestration logic. AgentAssay introduces stochastic three-valued verdicts (PASS/FAIL/INCONCLUSIVE)
grounded in statistical hypothesis testing, five-dimensional agent coverage metrics, agent-specific
mutation testing operators, and a token-efficient testing pipeline that achieves 78-100% cost
reduction while maintaining rigorous statistical guarantees.
Key results from experiments across 5 models (GPT-5.2, Claude Sonnet 4.6, Mistral-Large-3,
Llama-4-Maverick, Phi-4), 3 scenarios, and 6,500 trials ($59.64 total cost):
- SPRT achieves 78% trial savings across all scenarios
- Behavioral fingerprinting achieves 79% detection power where binary pass/fail testing has 0%
- Full token-efficient pipeline achieves 100% cost savings through trace-first offline analysis
The implementation comprises ~20,000 lines of Python with 751 tests and adapters for 10 agent
frameworks (LangGraph, CrewAI, AutoGen, OpenAI, smolagents, Semantic Kernel, Bedrock, MCP, Vertex AI,
and generic).
Technical Report. 52 pages, 5 figures, 9 theorems, 42 formal definitions.
Files
main.pdf
Files
(469.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:7415fbeb3d0e2682015434df01b3fec8
|
469.4 kB | Preview Download |
Additional details
Related works
- Cites
- Preprint: arXiv:2602.22302 (arXiv)
References
- References: arXiv:2602.22302 (Agent Behavioral Contracts — prior work by same author)