Counterfactual Ablation for Memory-Utility Evaluation: A Pre-Registered Case Study in Specialist Re-ranking

Jürschik, Max

doi:10.5281/zenodo.20483675

Published June 1, 2026 | Version v1.0.1

Preprint Open

Counterfactual Ablation for Memory-Utility Evaluation: A Pre-Registered Case Study in Specialist Re-ranking

Jürschik, Max (Project leader)¹

1. Kill The Dragon

Context allocation across time — not context length — is the central memory problem for retrieval-augmented language-model agents. The paper's methodological contribution is counterfactual ablation as a per-memory utility signal: remove each retrieved memory in turn and label it by the resulting change in answerer correctness. The construction is non-circular by three structural arguments, with Spearman correlations from $-0.024$ to $+0.161$ across four large-scale runs — three within-pipeline on MemoryAgentBench and LoCoMo, one substrate-independent on LoCoMo Multi-Hop whose CI spans zero and which we treat as binding. We exercise the signal on one operationalization of the hypothesis that context-allocation requires per-memory utility distinct from cosine, and report a documented dissolution as the case study.

A 1.5B-parameter LoRA specialist trained on these labels produced point-estimate gains of $+8/+7/+4/+5$ substring-exact-match over vanilla retrieval at $K=5$ on MAB. Five rigor layers tighten this result. Paired-bootstrap $95\%$ CIs leave two strictly significant cells. $K$-normalization to the published comparator depth leaves $1/4$ datasets within $\pm 2$pp, on partial data. BM25 sparse retrieval beats the specialist by $+13$ to $+22$pp on three of four datasets, reframing the K=5 gains as "less suboptimal than BGE cosine alone" rather than competitive. Cross-substrate transfer to LoCoMo Multi-Hop returns F1 $17.0\%$ against a published $45.85\%$ (Xu et al., 2025), but a prompt-control shows the specialist contributes $+13$pp over vanilla cosine on the same prompt — the residual gap is pipeline-attributable, not total cross-substrate failure. Learning-pattern probes score memory-equals-query at $100\%$ above zero and fail label discrimination on a held-out validation sample. What survives: counterfactual ablation as a non-circular outcome signal and the rigor-dissolution discipline with pre-registered ADRs anchored to public git history. The broader hypothesis remains untested under operationalizations we did not run.

Files

ktdmax/supabrain-v1.0.1.zip

Files (97.0 MB)

Name	Size	Download all
ktdmax/supabrain-v1.0.1.zip md5:3489a84dc81f0a8e13782cd76cc1787c	97.0 MB	Preview Download

Additional details

Is supplement to: Software: https://github.com/ktdmax/supabrain/tree/v1.0.1 (URL)

Repository URL: https://github.com/ktdmax/supabrain

	All versions	This version
Views	8	8
Downloads	1	1
Data volume	194.0 MB	194.0 MB

Counterfactual Ablation for Memory-Utility Evaluation: A Pre-Registered Case Study in Specialist Re-ranking

Authors/Creators

Description

Files

ktdmax/supabrain-v1.0.1.zip

Files (97.0 MB)

Additional details

Related works

Software