Published June 4, 2026 | Version v1
Other Open

The Lever Is Late: Causal Control of Long-Horizon Agent Termination Lives in a Task-Matched, Late Action-Commitment Block

Authors/Creators

  • 1. OpenInterpretability

Description

The first POSITIVE of the five-part WANDERING arc on long-horizon coding-agent failure (agents that keep acting but never emit the terminating 'finish' tool call). The arc established that the agent's 'task-done' verdict is linearly decodable (AUROC 0.81-0.91) yet causally inert: no residual injection rescues termination, and clamping the exact, named SAE 'done' feature moves the probability of finishing by -0.001. This paper localizes where termination control actually lives. On 99 Qwen3.6-27B SWE-bench Pro trajectories, reconstructed faithfully at the decision point and gated for behavioral fidelity (P(finish): SUCCESS 0.59 >> WANDERING 0.07 >> LOCKED 0.005), a layer-resolved logit-lens shows the finish decision is invisible through layer 31 and emerges only in the last ~12 of 64 layers (L51-L63), ~30 layers downstream of the mid-layer verdict (L23). Activation patching confirms the asymmetry causally: injecting the SUCCESS late-block state into WANDERING raises P(finish) (+0.13 at L55, +0.15 at L59; donor-specific -- the LOCKED donor moves it the other way), while every mid-layer and verdict-feature intervention is null. Critically, the effect survives a real generation: patching the late block at the decision point alone makes the agent emit a well-formed 'finish' tool call in 42% of WANDERING decision points (5/12; exact one-sided McNemar p=0.031 versus a 0/12 baseline and a 0/12 LOCKED-donor null) -- but only when the donor is task-matched; a coarse class-mean donor is not significant (25%, p=0.125). This is the first internal causal lever of the arc and reframes the knowledge-action gap on agents as a LAYER gap: the termination decision is known mid-stream but only writable late. The verdict-null to late-lever jump is a controlled, same-experiment contrast; a separate behavioral interruption gives a comparable lift (30%->70%). Released with a model-agnostic 'decision-locator' tool that finds and steers the commitment layer for any tool-calling decision on any open-weight model. Honest scope: single model, single task family, n=12; the positive headline depends on the task-matched donor (coarse mean n.s.). Pre-registration, figure code, notebooks, the tool, the pre-mint eval, and per-experiment results are in the GitHub repository under paper/breakthrough/ and tools/decision_locator/.

Files

verdict_lever_paper.pdf

Files (287.4 kB)

Name Size Download all
md5:6415773cedee7997b6bac90f4aacaa9b
287.4 kB Preview Download

Additional details