Published April 29, 2025 | Version 1.0
Publication Open

Binary Token Memory: A Scalable Compression Framework for Efficient LLM Inference

Creators

  • 1. Hekim Net

Description

This work presents the updated and fully binary-based evolution of our previously published Visual Token Memory concept. While the original study explored visual encoding via PNG containers, this paper formalizes an efficient Zstd-compressed binary token memory format designed for scalable LLM applications.

This paper presents a lightweight and efficient binary container format for storing and reusing tokenized representations of LLM conversation histories.
Unlike prior approaches such as image-based Visual Token Memory (VTM v1), the new Binary Token Memory (VTM v2) leverages Zstandard-compressed raw token arrays with minimal headers.
The result is a scalable, GPU-compatible, and persistent memory format that eliminates re-tokenization latency while preserving exact model input structure.
Real-world benchmarks conducted on a modest Intel i5 system demonstrate significant speed and storage improvements over traditional JSON or PNG formats.
The system supports batch encoding/decoding, precise validation, and future extensions for encrypted or multimodal memory.
Aimed at improving production LLM throughput, this method is defensively published to protect prior art and is shared under a CC-BY-NC 4.0 license.

Files

Binary_Token_Memory_Zenodo.pdf

Files (746.3 kB)

Name Size Download all
md5:f5ead4dd5f23c0c6a649ed1de6e78173
2.9 kB Download
md5:09be41b376fc33b620590cc050387c29
697.1 kB Preview Download
md5:cb1cdb18d8493e6f16a3135228b31b53
770 Bytes Download
md5:9882c630601f707eaf0e57629b66c26a
4.3 kB Download
md5:c527472c39bfd0d4a1972508b09bceec
3.8 kB Download
md5:2a9eb6da332ee0d3e3c70d1b4b14c4b5
1.9 kB Download
md5:e395668d0134ef4407d6cdd92261e718
35.5 kB Preview Download

Additional details

Related works

Is previous version of
Publication: 10.5281/zenodo.15291754 (DOI)

Software

Programming language
Python, C