ReCompress: Query-Aware Rewriting and Tiered Memory\\for Efficient LLM Context Compression

Kshirsagar, Parth Sanjay; Pandey, Kartikey

doi:10.5281/zenodo.20786357

Published June 21, 2026 | Version v1

Preprint Open

ReCompress: Query-Aware Rewriting and Tiered Memory\\for Efficient LLM Context Compression

Large language models face two compounding token inefficiencies: single-turn contexts contain
irrelevant passages that consume budget without contributing to answers, and multi-turn conversations
resend full history every call, causing cumulative cost to grow quadratically with conversation length.
Deletion-based compression approaches are query-independent and cannot drop entire irrelevant
passages; multi-turn memory systems lack explicit protection for the bridging facts that multi-hop
reasoning depends on. We present ReCompress, a two-component system addressing both regimes. A
query-aware rewriting compressor, distilled into a 1.5B student (Qwen2.5-1.5B + LoRA), outperforms
bear-1.1 by +0.252 F1 on HotpotQA while emitting roughly 8.5× fewer tokens (48 vs. 409 at a ratio-
0.3 compression instruction). The gain is significant on multi-hop question answering with distractors
(HotpotQA, and the near-in-distribution 2WikiMultiHop, +0.180 F1) and positive-but-not-significant
on more dissimilar tasks (MuSiQue, SQuAD) at n = 50; we make the narrower claim the data
supports. We further audit the result against ourselves: the gap survives an independent solver, and a
mask-the-answer probe shows a substantial share of the margin comes from reliably retaining the
answer-bearing span at a 3.5% budget where deletion truncates it. A tiered multi-turn framework,
RbD-Compress, holds the context sent to the solver flat through protected trauma memory, a
versioned checkpoint stack with rollback, and Echidna, an intelligent trigger that reads trauma
memory before compression decisions, at no measurable loss in answer quality — a flatness result
we scope carefully against per-turn compression overhead and KV-caching assumptions. Our results
show that query-aware rewriting and deletion-based compression serve complementary operating
regimes.

Files

ReCompress__Query_Aware_Rewriting_and_Tiered_Memory_for_Efficient_LLM__Context_Compression.pdf

Files (1.1 MB)

Name	Size	Download all
ReCompress__Query_Aware_Rewriting_and_Tiered_Memory_for_Efficient_LLM__Context_Compression.pdf md5:c7210a97f7fecb0f6c474cdc176ea264	1.1 MB	Preview Download

Additional details

Repository URL: https://github.com/Kart-ing/ReCompress
Programming language: Python
Development Status: Active

	All versions	This version
Views	68	33
Downloads	3	1
Data volume	4.4 MB	2.2 MB

ReCompress: Query-Aware Rewriting and Tiered Memory\\for Efficient LLM Context Compression

Authors/Creators

Description

Files

ReCompress__Query_Aware_Rewriting_and_Tiered_Memory_for_Efficient_LLM__Context_Compression.pdf

Files (1.1 MB)

Additional details

Software