ReCompress: Query-Aware Rewriting and Tiered Memory\\for Efficient LLM Context Compression
Authors/Creators
Description
Large language models face two compounding token inefficiencies: single-turn contexts contain
irrelevant passages that consume budget without contributing to answers, and multi-turn conversations
resend full history every call, causing cumulative cost to grow quadratically with conversation length.
Deletion-based compression approaches are query-independent and cannot drop entire irrelevant
passages; multi-turn memory systems lack explicit protection for the bridging facts that multi-hop
reasoning depends on. We present ReCompress, a two-component system addressing both regimes. A
query-aware rewriting compressor, distilled into a 1.5B student (Qwen2.5-1.5B + LoRA), outperforms
bear-1.1 by +0.252 F1 on HotpotQA while emitting roughly 8.5× fewer tokens (48 vs. 409 at a ratio-
0.3 compression instruction). The gain is significant on multi-hop question answering with distractors
(HotpotQA, and the near-in-distribution 2WikiMultiHop, +0.180 F1) and positive-but-not-significant
on more dissimilar tasks (MuSiQue, SQuAD) at n = 50; we make the narrower claim the data
supports. We further audit the result against ourselves: the gap survives an independent solver, and a
mask-the-answer probe shows a substantial share of the margin comes from reliably retaining the
answer-bearing span at a 3.5% budget where deletion truncates it. A tiered multi-turn framework,
RbD-Compress, holds the context sent to the solver flat through protected trauma memory, a
versioned checkpoint stack with rollback, and Echidna, an intelligent trigger that reads trauma
memory before compression decisions, at no measurable loss in answer quality — a flatness result
we scope carefully against per-turn compression overhead and KV-caching assumptions. Our results
show that query-aware rewriting and deletion-based compression serve complementary operating
regimes.
Files
ReCompress__Query_Aware_Rewriting_and_Tiered_Memory_for_Efficient_LLM__Context_Compression.pdf
Files
(1.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:c7210a97f7fecb0f6c474cdc176ea264
|
1.1 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/Kart-ing/ReCompress
- Programming language
- Python
- Development Status
- Active