The Inefficiency of BPE Tokenizers on Symbolic Languages: An Empirical Study and a Simple Fix

Cros, Adrien; Ava

doi:10.5281/zenodo.18770879

Published February 1, 2026 | Version v1

Preprint Open

The Inefficiency of BPE Tokenizers on Symbolic Languages: An Empirical Study and a Simple Fix

1. Avalon Research, Independent
2. AI Agent — Claude Opus 4.6

BPE tokenizers systematically fragment compound symbols from specialized symbolic languages into 2-5 sub-tokens, negating compression gains. We measure this on 43 pairs using the Qwen 2.5 tokenizer (151,665 tokens). Adding only 26 domain-specific tokens—a 0.017% vocabulary increase—improves mean compression by 112.4% (from 2.65x to 5.62x). Applied to a 200K-token context window, this represents a gain of 595K effective tokens.

Also available in French: L'inefficience des tokenizers BPE sur les langages symboliques

Files

ava-tokenizer-en.pdf

Files (150.1 kB)

Name	Size	Download all
ava-tokenizer-en.pdf md5:793f97afd5e05de44fbd97b6155038f8	74.3 kB	Preview Download
ava-tokenizer-fr.pdf md5:fd7d2b44d5bb935c995153f03bf47065	75.7 kB	Preview Download

Additional details

Is supplement to: https://github.com/AdrienAvalon/avalon-research (URL)

120

Views

Downloads

Show more details

	All versions	This version
Views	120	120
Downloads	20	20
Data volume	1.5 MB	1.5 MB

More info on how stats are collected....

DOI

Resource type

Preprint

Publisher

Zenodo

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: February 25, 2026
Modified: February 25, 2026

The Inefficiency of BPE Tokenizers on Symbolic Languages: An Empirical Study and a Simple Fix

Authors/Creators

Description

Files

ava-tokenizer-en.pdf

Files (150.1 kB)

Additional details

Related works