Published February 1, 2026
| Version v1
Preprint
Open
The Inefficiency of BPE Tokenizers on Symbolic Languages: An Empirical Study and a Simple Fix
Description
BPE tokenizers systematically fragment compound symbols from specialized symbolic languages into 2-5 sub-tokens, negating compression gains. We measure this on 43 pairs using the Qwen 2.5 tokenizer (151,665 tokens). Adding only 26 domain-specific tokens—a 0.017% vocabulary increase—improves mean compression by 112.4% (from 2.65x to 5.62x). Applied to a 200K-token context window, this represents a gain of 595K effective tokens.
Also available in French: L'inefficience des tokenizers BPE sur les langages symboliques
Files
ava-tokenizer-en.pdf
Files
(150.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:793f97afd5e05de44fbd97b6155038f8
|
74.3 kB | Preview Download |
|
md5:fd7d2b44d5bb935c995153f03bf47065
|
75.7 kB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/AdrienAvalon/avalon-research (URL)