There is a newer version of the record available.

Published July 11, 2025 | Version 2.1.0
Preprint Open

AG-BPE: Attention-Guided Byte-Pair Encoding for Semantic-Aware Tokenization

Description

Standard subword tokenization methods like Byte-Pair Encoding (BPE) are foundational to modern LLMs but often create suboptimal, semantically incoherent tokens. We introduce Attention-Guided BPE (AG-BPE), a novel approach that enhances the BPE algorithm by incorporating a semantic-aware guidance mechanism. A lightweight Transformer encoder informs merge decisions via a hybrid score, favoring tokens that are both frequent and semantically coherent. Benchmarked against industry standards, AG-BPE, trained on a modest 302 MB dataset, demonstrates a state-of-the-art compression ratio (3.77x) with a vocabulary up to 12 times more compact than its competitors. It combines this efficiency with perfect robustness on modern, multilingual text and a decoding speed over 30x faster. Qualitative analysis reveals its unique ability to identify and preserve fundamental morphological units, offering a promising direction for more efficient and interpretable language models.

Files

Attention_Guided_BPE__AG_BPE_Théo_CHARLET.pdf

Files (691.6 kB)

Name Size Download all
md5:2543be5c1384ae6917159a0ee561ba3f
482.0 kB Preview Download
md5:a2dc946d6eb904c7dfd8de1047318be6
187.0 kB Preview Download
md5:cc9c6209cd4435cee0049a1a50acf047
10.2 kB Download
md5:d3726180cae03fe78229b39e2352720e
12.5 kB Download

Additional details

Software

Repository URL
https://github.com/RDTvlokip/AG-BPE
Programming language
Python
Development Status
Active