Think Less, Store Smarter: A Theoretical Framework for Type-Aware KV Cache Quantization in Large Reasoning Models
Authors/Creators
Description
This paper introduces the Think-Answer Quantization Gap (TAQG), a theoretical framework proving that uniform KV cache quantization is provably suboptimal for large reasoning models whenever think-phase and answer-phase tokens differ in pairwise cosine redundancy. The framework is direction-agnostic: it prescribes fewer bits for whichever phase exhibits higher redundancy. Empirical validation on DeepSeek-R1-Distill-Qwen-1.5B reveals a surprising model-size-dependent redundancy reversal, where answer-phase tokens exhibit higher redundancy than think-phase tokens - opposite to findings on the full 671B model. Code and experimental data are included.
Files
taqg_paper.pdf
Files
(556.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:dacadcd1a809bb823395225539b63c34
|
289.7 kB | Preview Download |
|
md5:4f85b7343c747ba1533cbbd2e9fcf140
|
266.8 kB | Preview Download |