Impact of Layer-wise KV Cache Reconstruction on Artificially Inflated Needle-in-a-Haystack Scores in Ultra-Long Context Tasks
Description
Large Language Models (LLMs) require significant GPU memory when processing long texts, with the key value (KV) cache consuming up to 70\% of total memory during inference. Although existing compression methods reduce memory by evaluating the importance of individual tokens, they overlook critical semantic relationships between tokens, resulting in fragmented context and degraded performance. We introduce ChunkKV, which fundamentally reimagines KV cache compression by treating semantic chunks - rather than isolated tokens - as basic compression units. This approach preserves complete linguisti
Research goal: To what extent does layer-wise KV cache reconstruction in methods like ReST-KV artificially inflate needle-in-a-haystack scores relative to standard eviction policies on ultra-long context tasks?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 8.6/10.
Notes
Files
paper.pdf
Files
(82.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:c28e2ec867573a15517330123d699a15
|
82.5 kB | Preview Download |