Published June 1, 2026 | Version v1.0.1
Preprint Open

Counterfactual Ablation for Memory-Utility Evaluation: A Pre-Registered Case Study in Specialist Re-ranking

  • 1. Kill The Dragon

Description

Context allocation across time — not context length — is the central memory problem for retrieval-augmented language-model agents. The paper's methodological contribution is counterfactual ablation as a per-memory utility signal: remove each retrieved memory in turn and label it by the resulting change in answerer correctness. The construction is non-circular by three structural arguments, with Spearman correlations from $-0.024$ to $+0.161$ across four large-scale runs — three within-pipeline on MemoryAgentBench and LoCoMo, one substrate-independent on LoCoMo Multi-Hop whose CI spans zero and which we treat as binding. We exercise the signal on one operationalization of the hypothesis that context-allocation requires per-memory utility distinct from cosine, and report a documented dissolution as the case study.

A 1.5B-parameter LoRA specialist trained on these labels produced point-estimate gains of $+8/+7/+4/+5$ substring-exact-match over vanilla retrieval at $K=5$ on MAB. Five rigor layers tighten this result. Paired-bootstrap $95\%$ CIs leave two strictly significant cells. $K$-normalization to the published comparator depth leaves $1/4$ datasets within $\pm 2$pp, on partial data. BM25 sparse retrieval beats the specialist by $+13$ to $+22$pp on three of four datasets, reframing the K=5 gains as "less suboptimal than BGE cosine alone" rather than competitive. Cross-substrate transfer to LoCoMo Multi-Hop returns F1 $17.0\%$ against a published $45.85\%$ (Xu et al., 2025), but a prompt-control shows the specialist contributes $+13$pp over vanilla cosine on the same prompt — the residual gap is pipeline-attributable, not total cross-substrate failure. Learning-pattern probes score memory-equals-query at $100\%$ above zero and fail label discrimination on a held-out validation sample. What survives: counterfactual ablation as a non-circular outcome signal and the rigor-dissolution discipline with pre-registered ADRs anchored to public git history. The broader hypothesis remains untested under operationalizations we did not run.

Files

ktdmax/supabrain-v1.0.1.zip

Files (97.0 MB)

Name Size Download all
md5:3489a84dc81f0a8e13782cd76cc1787c
97.0 MB Preview Download

Additional details

Related works

Is supplement to
Software: https://github.com/ktdmax/supabrain/tree/v1.0.1 (URL)

Software