Mechanism Before Metric: How Process-Outcome Decoupling Exposes the Hidden Failure Modes of Agentic AI Evaluation
Description
A recurring structural problem haunts the evaluation of agentic AI systems: the metrics used to declare success measure outcomes while remaining largely blind to the processes that produce them. This paper synthesizes findings from six to eight recent preprints spanning clinical decision-making agents, long-horizon data analysis, multi-hop retrieval, continual learning, behavioral trajectory tracking, and multimodal oncology modeling to argue that **process-outcome decoupling**—the systematic divergence between what an agent achieves and how it achieves it—is not a measurement inconvenience but a fundamental signal about where and why agentic systems fail. Our thesis is that outcome-only evaluation is not merely incomplete; it is actively misleading, because it masks failure modes that only become visible when the acquisition and reasoning process is observed directly. We draw on specific empirical findings: the strongest clinical agent in ClinEnv achieves 0.31 decision F1 while exhibiting a 3:1 ratio between diagnosis recovery and management action quality [corpus:arxiv:2606.02568]; the best long-horizon data analysis agent reaches only 48.45% average accuracy with performance dropping 47 points from early to late turns, and additional agent steps do not improve this [corpus:arxiv:2605.30434]; multimodal oncology models can achieve accurate predictions while DECAT reveals they are learning confounders rather than shared biology across modalities [corpus:arxiv:2605.31504]; and behavioral configuration files can be tracked for trait drift with 91.2% sign classification accuracy, making previously invisible behavioral evolution measurable [corpus:arxiv:2606.02536]. The falsification path for our central thesis is direct: if process-aware metrics and outcome metrics were found to be highly correlated across a large, diverse set of agentic tasks, the added cost of process evaluation would be unjustified. Current evidence suggests the opposite. We also draw on findings from continual learning evaluation [corpus:arxiv:2606.02461] and CAPTCHA-boundary agent testing [corpus:arxiv:2606.02449] to characterize the structural conditions under which decoupling is most severe. ---
Authorship: Saluca Agentic AI Research Team (Saluca LLC). AI-drafted from arXiv preprint corpus on the date in the filename.
Cited arXiv preprints: 2605.30434, 2605.31504, 2606.02449, 2606.02461, 2606.02536, 2606.02568
Notes
Files
20260602_amazo_process-outcome-decoupling-agentic-ai-evaluation.pdf
Files
(58.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:0f8345bb9cab40bb3cbcd15aeb4dd1bd
|
58.5 kB | Preview Download |