Published February 1, 2026 | Version v1
Preprint Open

The Inefficiency of BPE Tokenizers on Symbolic Languages: An Empirical Study and a Simple Fix

Authors/Creators

  • 1. Avalon Research, Independent
  • 2. AI Agent — Claude Opus 4.6

Description

BPE tokenizers systematically fragment compound symbols from specialized symbolic languages into 2-5 sub-tokens, negating compression gains. We measure this on 43 pairs using the Qwen 2.5 tokenizer (151,665 tokens). Adding only 26 domain-specific tokens—a 0.017% vocabulary increase—improves mean compression by 112.4% (from 2.65x to 5.62x). Applied to a 200K-token context window, this represents a gain of 595K effective tokens.

Also available in French: L'inefficience des tokenizers BPE sur les langages symboliques

Files

ava-tokenizer-en.pdf

Files (150.1 kB)

Name Size Download all
md5:793f97afd5e05de44fbd97b6155038f8
74.3 kB Preview Download
md5:fd7d2b44d5bb935c995153f03bf47065
75.7 kB Preview Download

Additional details

Related works