Mechanism Before Metric: How Process-Outcome Decoupling Exposes the Hidden Failure Modes of Agentic AI Evaluation
Description
Version 2 — revised in response to an external structural review and an automated critique pass. See "Response to Review" appendix in the PDF for the change log.
A recurring structural problem haunts the evaluation of agentic AI systems: the metrics used to declare success measure outcomes while remaining largely blind to the processes that produce them. This paper offers a **heuristic reading**—not a formal derivation—of findings from six recent preprints spanning clinical decision-making agents, long-horizon data analysis, multi-hop retrieval, continual learning, behavioral trajectory tracking, and multimodal oncology modeling. The reading argues that **process-outcome decoupling**—the systematic divergence between what an agent achieves and how it achieves it—is not a measurement inconvenience but a signal worth attending to, because it is consistent with the presence of failure modes that only become visible when the acquisition and reasoning process is observed directly. We draw on specific empirical findings: the strongest clinical agent in ClinEnv achieves 0.31 decision F1 while exhibiting a 3:1 ratio between diagnosis recovery and management action quality [corpus:arxiv:2606.02568]; the best long-horizon data analysis agent reaches only 48.45% average accuracy with performance dropping 47 points from early to late turns, and the authors report that additional agent steps do not improve this [corpus:arxiv:2605.30434]; multimodal oncology models can achieve accurate predictions while DECAT's rule-based diagnostic procedure suggests they may be learning confounders rather than shared biology across modalities, with the caveat that the diagnostic procedure itself may have failure modes [corpus:arxiv:2605.31504]; and behavioral configuration files can be tracked for trait drift with 91.2% sign classification accuracy on a small evaluation set of 68 labeled skill diff pairs for a single trait dimension, making previously invisible behavioral evolution measurable within that narrow scope [corpus:arxiv:2606.02536]. The falsification path for our central thesis is direct: if process-aware metrics and outcome metrics were found to be highly correlated across a large, diverse set of agentic tasks, the added cost of process evaluation would be unjustified. Current evidence is consistent with the opposite, though it is drawn from a small and potentially unrepresentative corpus. We also draw on findings from continual learning evaluation [corpus:arxiv:2606.02461] to characterize structural conditions under which decoupling may be most severe; a related benchmark on CAPTCHA-boundary agent testing [corpus:arxiv:2606.02449] is discussed separately in an addendum as a more loosely connected case. ---
Authorship: Saluca Agentic AI Research Team (Saluca LLC). AI-drafted from arXiv preprint corpus on the date in the filename.
Cited arXiv preprints: 2605.30434, 2605.31504, 2606.02449, 2606.02461, 2606.02536, 2606.02568
Notes
Files
20260602_amazo_process-outcome-decoupling-agentic-ai-evaluation_v2.pdf
Files
(72.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:2ae0f65df663d1f3e8ccd971746cf95d
|
72.7 kB | Preview Download |