Binary Token Memory: A Scalable Compression Framework for Efficient LLM Inference
Description
This work presents the updated and fully binary-based evolution of our previously published Visual Token Memory concept. While the original study explored visual encoding via PNG containers, this paper formalizes an efficient Zstd-compressed binary token memory format designed for scalable LLM applications.
This paper presents a lightweight and efficient binary container format for storing and reusing tokenized representations of LLM conversation histories.
Unlike prior approaches such as image-based Visual Token Memory (VTM v1), the new Binary Token Memory (VTM v2) leverages Zstandard-compressed raw token arrays with minimal headers.
The result is a scalable, GPU-compatible, and persistent memory format that eliminates re-tokenization latency while preserving exact model input structure.
Real-world benchmarks conducted on a modest Intel i5 system demonstrate significant speed and storage improvements over traditional JSON or PNG formats.
The system supports batch encoding/decoding, precise validation, and future extensions for encrypted or multimodal memory.
Aimed at improving production LLM throughput, this method is defensively published to protect prior art and is shared under a CC-BY-NC 4.0 license.
Files
Binary_Token_Memory_Zenodo.pdf
Files
(746.3 kB)
Name | Size | Download all |
---|---|---|
md5:f5ead4dd5f23c0c6a649ed1de6e78173
|
2.9 kB | Download |
md5:09be41b376fc33b620590cc050387c29
|
697.1 kB | Preview Download |
md5:cb1cdb18d8493e6f16a3135228b31b53
|
770 Bytes | Download |
md5:9882c630601f707eaf0e57629b66c26a
|
4.3 kB | Download |
md5:c527472c39bfd0d4a1972508b09bceec
|
3.8 kB | Download |
md5:2a9eb6da332ee0d3e3c70d1b4b14c4b5
|
1.9 kB | Download |
md5:e395668d0134ef4407d6cdd92261e718
|
35.5 kB | Preview Download |
Additional details
Related works
- Is previous version of
- Publication: 10.5281/zenodo.15291754 (DOI)