Published April 10, 2026 | Version v4
Preprint Open

Think Less, Store Smarter: A Theoretical Framework for Type-Aware KV Cache Quantization in Large Reasoning Models

Authors/Creators

Description

This paper introduces the Think-Answer Quantization Gap (TAQG), a theoretical framework proving that uniform KV cache quantization is provably suboptimal for large reasoning models whenever think-phase and answer-phase tokens differ in pairwise cosine redundancy. The framework is direction-agnostic: it prescribes fewer bits for whichever phase exhibits higher redundancy. Empirical validation on DeepSeek-R1-Distill-Qwen-1.5B reveals a surprising model-size-dependent redundancy reversal, where answer-phase tokens exhibit higher redundancy than think-phase tokens - opposite to findings on the full 671B model. Code and experimental data are included.

Files

taqg_paper.pdf

Files (556.5 kB)

Name Size Download all
md5:dacadcd1a809bb823395225539b63c34
289.7 kB Preview Download
md5:4f85b7343c747ba1533cbbd2e9fcf140
266.8 kB Preview Download