CoDA-GQA-L: Bounded-Memory Differential Attention with Value-Routed Landmark Banks
Description
We present CoDA-GQA-L, an attention mechanism that provably bounds per-layer KV cache memory to O(W+Me+Ms) independent of sequence length while retaining selective long-range context through dual memory banks. The architecture combines three innovations: (1) Constrained Orthogonal Differential Attention (CoDA), which sharpens attention by subtracting a gated inhibitory stream produced via learnable orthogonal rotation of the signal query, eliminating the second query projection required by prior differential attention; (2) a dual-bank bounded memory comprising an exact landmark bank for high-fidelity token retention and an EMA summary bank for thematic compression; and (3) value-routed semantic matching that ensures position-invariant memory updates despite RoPE-at-write key storage. A two-phase training protocol first teaches differential attention with full context, then adapts the model to bounded memory. Benchmarks across three model scales (Eve-2, 7B, 70B parameters) on NVIDIA H200 demonstrate up to 37× per-layer memory compression, scale-invariant bounded prefill through- put of ∼150K tokens/second regardless of model dimension, and measured compression ratios exceeding 1,100× at 70B scale with 128K context. The bounded state is a fixed-size serializable artifact, enabling a new paradigm of Stateful Neural Databases for agentic retrieval-augmented generation
Files
00_CoDA-GQA-L_Bounded-Memory_Differential_Attention_Maio(2026).pdf
Files
(1.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:5807652e2f4108189f3e5748942b2b05
|
1.1 MB | Preview Download |
|
md5:52a1f22419c59d1628bb2fdbb0ecd5c6
|
60.0 kB | Download |
|
md5:9b532d789bef3dfb0d622ff5c57d6453
|
14.0 kB | Download |
Additional details
Software
- Repository URL
- https://github.com/anthony-maio/CoDA-GQA-L
- Development Status
- Active