Published April 10, 2026 | Version v1
Preprint Open

The Trigger Is the Product: Why Knowing When to Intervene Matters More Than How

  • 1. CAUM Systems

Description

In a prior study [1], we demonstrated that a single interpretive sentence added to an observer advisory is the difference between 0/13 and 7/7 rescue rate on stuck LLM coding agents. This paper extends that work in three directions with results that substantially revise our original conclusions. First, we show that two BigCodeBench tasks previously classified as "capability floors" of gpt-4o-mini (BCB/17, BCB/26) — tasks we claimed no intervention could rescue — are in fact hidden semantic blind-spots rescuable by a cross-vendor open-weight LLM advisor (Gemma 4 31B, Google) running locally. Over n=8 controlled reruns each, baseline rescues 0/16 while the advisor rescues 15/16 (93.75%), refuting the capability-floor category from our original paper. Second, we demonstrate empirically that indiscriminate deployment of the same advisor produces a net-negative effect: over 200 BigCodeBench tasks, the advisor rescues 7 tasks but regresses 11, yielding -2.0 percentage points vs baseline. Third, we show that a smart trigger — raising the intervention threshold from 3 to 5 consecutive failures and requiring stderr similarity ≥60% before firing — eliminates 8 of 11 regressions while preserving all rescues, inverting the net effect from -4 to +4. The central finding is that the value of an LLM-based rescue system lies not in the advisor's diagnostic capability but in the precision of the trigger that decides when to invoke it. An advisor without a precise trigger is worse than no advisor. An advisor with a precise trigger rescues tasks that were previously thought unrescuable. We argue this has direct implications for the design of production agent observability systems: structural loop detection is the necessary complement to semantic intervention, and neither alone achieves optimal results. All experiments use gpt-4o-mini as the main coding agent and Gemma 4 31B (open-weight, running locally on a single A100 GPU) as the cross-vendor advisor, with CAUM as the structural detection layer.

Notes

This work uses CAUM (https://doi.org/10.5281/zenodo.18927886) for loop detection and orthogonal validation.

Files

PAPER_ZENODO_v2.md

Files (28.2 kB)

Name Size Download all
md5:244267b908ff6e8bde490e41d7e3dec0
28.2 kB Preview Download

Additional details

Related works

Is new version of
10.5281/zenodo.19463134 (DOI)

References