Published March 3, 2026 | Version v1
Preprint Open

AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows

Authors/Creators

Contributors

Description

AgentAssay is the first token-efficient framework for regression testing non-deterministic AI agent
  workflows. Autonomous AI agents are deployed at unprecedented scale, yet no principled methodology
  existed for verifying that an agent has not regressed after changes to its prompts, tools, models, or
   orchestration logic. AgentAssay introduces stochastic three-valued verdicts (PASS/FAIL/INCONCLUSIVE)
   grounded in statistical hypothesis testing, five-dimensional agent coverage metrics, agent-specific
  mutation testing operators, and a token-efficient testing pipeline that achieves 78-100% cost
  reduction while maintaining rigorous statistical guarantees.
 
  Key results from experiments across 5 models (GPT-5.2, Claude Sonnet 4.6, Mistral-Large-3,
  Llama-4-Maverick, Phi-4), 3 scenarios, and 6,500 trials ($59.64 total cost):
  - SPRT achieves 78% trial savings across all scenarios
  - Behavioral fingerprinting achieves 79% detection power where binary pass/fail testing has 0%
  - Full token-efficient pipeline achieves 100% cost savings through trace-first offline analysis
 
  The implementation comprises ~20,000 lines of Python with 751 tests and adapters for 10 agent
  frameworks (LangGraph, CrewAI, AutoGen, OpenAI, smolagents, Semantic Kernel, Bedrock, MCP, Vertex AI,
   and generic).
 
  Technical Report. 52 pages, 5 figures, 9 theorems, 42 formal definitions.

Files

main.pdf

Files (469.4 kB)

Name Size Download all
md5:7415fbeb3d0e2682015434df01b3fec8
469.4 kB Preview Download

Additional details

Related works

Cites
Preprint: arXiv:2602.22302 (arXiv)

References

  • References: arXiv:2602.22302 (Agent Behavioral Contracts — prior work by same author)