Mechanism Before Metric: How Process-Outcome Decoupling Exposes the Hidden Failure Modes of Agentic AI Evaluation

Saluca Agentic AI Research Team

doi:10.5281/zenodo.20519736

Published June 3, 2026 | Version v1

Working paper Open

Mechanism Before Metric: How Process-Outcome Decoupling Exposes the Hidden Failure Modes of Agentic AI Evaluation

Saluca Agentic AI Research Team¹

1. Saluca LLC

A recurring structural problem haunts the evaluation of agentic AI systems: the metrics used to declare success measure outcomes while remaining largely blind to the processes that produce them. This paper synthesizes findings from six to eight recent preprints spanning clinical decision-making agents, long-horizon data analysis, multi-hop retrieval, continual learning, behavioral trajectory tracking, and multimodal oncology modeling to argue that **process-outcome decoupling**—the systematic divergence between what an agent achieves and how it achieves it—is not a measurement inconvenience but a fundamental signal about where and why agentic systems fail. Our thesis is that outcome-only evaluation is not merely incomplete; it is actively misleading, because it masks failure modes that only become visible when the acquisition and reasoning process is observed directly. We draw on specific empirical findings: the strongest clinical agent in ClinEnv achieves 0.31 decision F1 while exhibiting a 3:1 ratio between diagnosis recovery and management action quality [corpus:arxiv:2606.02568]; the best long-horizon data analysis agent reaches only 48.45% average accuracy with performance dropping 47 points from early to late turns, and additional agent steps do not improve this [corpus:arxiv:2605.30434]; multimodal oncology models can achieve accurate predictions while DECAT reveals they are learning confounders rather than shared biology across modalities [corpus:arxiv:2605.31504]; and behavioral configuration files can be tracked for trait drift with 91.2% sign classification accuracy, making previously invisible behavioral evolution measurable [corpus:arxiv:2606.02536]. The falsification path for our central thesis is direct: if process-aware metrics and outcome metrics were found to be highly correlated across a large, diverse set of agentic tasks, the added cost of process evaluation would be unjustified. Current evidence suggests the opposite. We also draw on findings from continual learning evaluation [corpus:arxiv:2606.02461] and CAPTCHA-boundary agent testing [corpus:arxiv:2606.02449] to characterize the structural conditions under which decoupling is most severe. ---

Authorship: Saluca Agentic AI Research Team (Saluca LLC). AI-drafted from arXiv preprint corpus on the date in the filename.

Cited arXiv preprints: 2605.30434, 2605.31504, 2606.02449, 2606.02461, 2606.02536, 2606.02568

Notes

This paper was AI-drafted by an internal multi-persona research agent over a curated arXiv corpus. It is not peer-reviewed. All cited works are listed by arXiv ID; readers should follow those links to verify claims against the primary preprints.

Files

20260602_amazo_process-outcome-decoupling-agentic-ai-evaluation.pdf

Files (58.5 kB)

Name	Size	Download all
20260602_amazo_process-outcome-decoupling-agentic-ai-evaluation.pdf md5:0f8345bb9cab40bb3cbcd15aeb4dd1bd	58.5 kB	Preview Download

	All versions	This version
Views	5	1
Downloads	5	1
Data volume	422.1 kB	58.5 kB

Mechanism Before Metric: How Process-Outcome Decoupling Exposes the Hidden Failure Modes of Agentic AI Evaluation

Authors/Creators

Description

Notes

Files

20260602_amazo_process-outcome-decoupling-agentic-ai-evaluation.pdf

Files (58.5 kB)