AG-BPE: Attention-Guided Byte-Pair Encoding for Semantic-Aware Tokenization

CHARLET, Théo

doi:10.5281/zenodo.15864340

Published July 11, 2025 | Version 2.1.0

Preprint Open

AG-BPE: Attention-Guided Byte-Pair Encoding for Semantic-Aware Tokenization

CHARLET, Théo

Standard subword tokenization methods like Byte-Pair Encoding (BPE) are foundational to modern LLMs but often create suboptimal, semantically incoherent tokens. We introduce Attention-Guided BPE (AG-BPE), a novel approach that enhances the BPE algorithm by incorporating a semantic-aware guidance mechanism. A lightweight Transformer encoder informs merge decisions via a hybrid score, favoring tokens that are both frequent and semantically coherent. Benchmarked against industry standards, AG-BPE, trained on a modest 302 MB dataset, demonstrates a state-of-the-art compression ratio (3.77x) with a vocabulary up to 12 times more compact than its competitors. It combines this efficiency with perfect robustness on modern, multilingual text and a decoding speed over 30x faster. Qualitative analysis reveals its unique ability to identify and preserve fundamental morphological units, offering a promising direction for more efficient and interpretable language models.

Files

Attention_Guided_BPE__AG_BPE_Théo_CHARLET.pdf

Files (691.6 kB)

Name	Size	Download all
ag_bpe_tokenizer.json md5:2543be5c1384ae6917159a0ee561ba3f	482.0 kB	Preview Download
Attention_Guided_BPE__AG_BPE_Théo_CHARLET.pdf md5:a2dc946d6eb904c7dfd8de1047318be6	187.0 kB	Preview Download
Attention_Guided_BPE__AG_BPE_Théo_CHARLET.tex md5:cc9c6209cd4435cee0049a1a50acf047	10.2 kB	Download
how_to_use.py md5:d3726180cae03fe78229b39e2352720e	12.5 kB	Download

Additional details

Repository URL: https://github.com/RDTvlokip/AG-BPE
Programming language: Python
Development Status: Active

	All versions	This version
Views	574	121
Downloads	516	159
Data volume	171.1 MB	33.2 MB

AG-BPE: Attention-Guided Byte-Pair Encoding for Semantic-Aware Tokenization

Creators

Description

Files

Attention_Guided_BPE__AG_BPE_Théo_CHARLET.pdf

Files (691.6 kB)

Additional details

Software