Published January 27, 2026 | Version v1
Publication Open

BitMamba-2: Efficient Scaling of 1.58-bit State Space Models

Authors/Creators

Description

The scaling of Large Language Models (LLMs) is traditionally constrained by the quadratic
complexity of Transformers and the memory bandwidth bottleneck associated with high-precision
weights. While State Space Models (SSMs) like Mamba have addressed the sequence scaling
limitation with linear-time complexity, the memory footprint remains a challenge for edge de-
ployment. In this work, we introduce BitMamba-2, a hybrid architecture that integrates
the 1.58-bit ternary quantization of BitNet into the Mamba-2 framework. We train two mod-
els from scratch: a 255M parameter baseline and a scaled-up 1B parameter model, utiliz-
ing a high-quality dataset mix comprising FineWeb-Edu, Cosmopedia, and The Stack-Dedup.
Our experiments, conducted on Google Cloud TPU v6e hardware, demonstrate strong scal-
ing laws for ternary SSMs. The 1B model achieves a 7.8% improvement in ARC-Easy ac-
curacy (63.3%) and a dramatic reduction in perplexity (from 51.69 to 29.62) compared to
the 255M baseline. Furthermore, we demonstrate that BitMamba-2 enables high-performance
inference on consumer CPUs, achieving ∼53 tokens/second on an Intel i3 processor with a
memory footprint of just 621 MB. Code and pre-trained checkpoints are publicly available at
https://github.com/Zhayr1/BitMamba-2, https://huggingface.co/Zhayr1/BitMamba-2-1B
and https://huggingface.co/Zhayr1/BitMamba-2-0.25B.

Files

BitMamba-2: Efficient Scaling of 1.58-bit State Space Models.pdf

Files (726.0 kB)

Additional details

Software