CrystalCache: Cross-Domain Transfer from Cognitive Memory Crystallization to KV Cache Eviction in Long-Context LLMs
Authors/Creators
Description
Abstract
The Key–Value (KV) cache of long-context Large Language Models (LLMs) grows linearly with context length and is now the dominant memory bottleneck of long-context inference; at 128K tokens a single batch of bf16 KV for Llama-3-8B
already exceeds the model weights themselves. Existing eviction methods fall into two generations. The first
generation (H2O, SnapKV, StreamingLLM, Scissorhands) summarises each token by a single scalar and evicts at token
granularity, producing "coverage holes" over semantically coherent passages. The second generation (ChunkKV, EpiCache,
CAOTE, DefensiveKV, PyramidKV) advances along a single axis each — fixed-size grouping, signal fusion, or robust
aggregation of repeated observations — but none simultaneously satisfies the four structural requirements of dynamic
semantic boundaries, two independent scoring dimensions, an explicit rarity signal, and progressive (rather than
binary) retention.
We propose CrystalCache, a KV-cache eviction algorithm derived from the structural predictions of the Crystallization
Memory Framework: that any system serving a memory function should describe each item along at least two independent
axes (analogous to a crystal's structural extent and formation strength) and should organise items as a multi-branch
trunk rather than a single block. CrystalCache instantiates these predictions in four concurrent design moves: (1) it
builds trunks — semantic units bounded by sentence punctuation and refined by co-attention — rather than fixed-size
chunks or utterance clusters; (2) it scores each trunk along two independently computed dimensions, an associative
crystallization term D (structural centrality in the trunk graph) and an encoding impact term M_i (attention salience
plus a Von Restorff rarity term), and composes them as Score = max(D, α · normalize(log(1 + M_i))), providing two
independent survival paths; (3) it injects an explicit token-frequency rarity signal U_i = 1 / (1 + log(1 + c_i))
directly into the score, a signal absent from all four contemporaneous works; and (4) it replaces binary retention
with a two-stage branch dissolution procedure that performs proportional retention between trunks and M_i-ranked
retention within trunks.
On Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, and Qwen3-8B, across Needle-in-a-Haystack and a Delayed
Association diagnostic at retention budgets β ∈ {0.3, 0.5}, CrystalCache wins all 3 × 2 × 2 = 12 retrieval comparisons
against H2O, SnapKV, ChunkKV, StreamingLLM, and PyramidKV; on Qwen3-8B Needle (β = 0.5) it doubles the best baseline
(0.333 vs. 0.167) and quadruples the weakest (vs. 0.083). Ablations identify the Von Restorff rarity term as the
single most impactful component (−0.383 when removed), confirm that trunk-level eviction outperforms token-level
(−0.317 when T_max = 1), and confirm that the dual-dimension max composition strictly beats either dimension alone. On
the broader-coverage LongBench suite, CrystalCache is competitive but not leading, a trade-off we attribute to the
spatial-coverage cost of trunk-level retention and discuss honestly as a limitation. The end-to-end system delivers
50–70% steady-state decode memory savings; the prefill overhead (54–64% at 16K–32K) stems entirely from a CPU-NumPy
O(n²) co-attention edge extraction and is engineering, not algorithmic.
Beyond the empirical result, the consistency of the 12/12 cross-model, cross-task, cross-budget gains constitutes a
computational corroboration of the structural predictions of the Crystallization Memory Framework: when a system
serves a memory function, structural principles derived from biological memory transfer non-trivially to its design.
Files
CRYSTALCACHE__CROSS_DOMAIN_TRANSFER_FROM_COGNITIVE_MEMORY_CRYSTALLIZATION_TO_KV_CACHE_EVICTION_IN_LONG_CONTEXT_LLMS.pdf
Files
(2.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:ca7e6a64ddc90c6d36a6454f8976469b
|
2.4 MB | Preview Download |