Published February 16, 2026 | Version v1
Preprint Open

CoDA-GQA-L: Bounded-Memory Differential Attention with Value-Routed Landmark Banks

  • 1. Making Minds AI

Contributors

  • 1. Making Minds AI

Description

We present CoDA-GQA-L, an attention mechanism that provably bounds per-layer KV cache memory to O(W+Me+Ms) independent of sequence length while retaining selective long-range context through dual memory banks. The architecture combines three innovations: (1) Constrained Orthogonal Differential Attention (CoDA), which sharpens attention by subtracting a gated inhibitory stream produced via learnable orthogonal rotation of the signal query, eliminating the second query projection required by prior differential attention; (2) a dual-bank bounded memory comprising an exact landmark bank for high-fidelity token retention and an EMA summary bank for thematic compression; and (3) value-routed semantic matching that ensures position-invariant memory updates despite RoPE-at-write key storage. A two-phase training protocol first teaches differential attention with full context, then adapts the model to bounded memory. Benchmarks across three model scales (Eve-2, 7B, 70B parameters) on NVIDIA H200 demonstrate up to 37× per-layer memory compression, scale-invariant bounded prefill through- put of 150K tokens/second regardless of model dimension, and measured compression ratios exceeding 1,100× at 70B scale with 128K context. The bounded state is a fixed-size serializable artifact, enabling a new paradigm of Stateful Neural Databases for agentic retrieval-augmented generation

Files

00_CoDA-GQA-L_Bounded-Memory_Differential_Attention_Maio(2026).pdf

Files (1.2 MB)

Name Size Download all
md5:5807652e2f4108189f3e5748942b2b05
1.1 MB Preview Download
md5:52a1f22419c59d1628bb2fdbb0ecd5c6
60.0 kB Download
md5:9b532d789bef3dfb0d622ff5c57d6453
14.0 kB Download

Additional details

Software

Repository URL
https://github.com/anthony-maio/CoDA-GQA-L
Development Status
Active