There is a newer version of the record available.

Published July 24, 2025 | Version 2.0.0
Preprint Open

AG-BPE: Advanced Benchmarking and Dataset Improvements

Description

Standard subword tokenization methods like Byte-Pair Encoding (BPE) are foundational to modern large language models but operate on purely statistical frequency, ignoring the semantic coherence of the tokens they create. This can lead to suboptimal segmentation that splits meaningful morphological units. We introduce Attention-Guided BPE (AG-BPE), a novel approach that enhances the BPE algorithm by incorporating a semantic-aware guidance mechanism. Instead of relying solely on frequency, AG-BPE's merge decisions are informed by a hybrid score that combines co-occurrence statistics with contextual attention scores from a lightweight Transformer encoder. This process favors the creation of tokens that are not only frequent but also semantically coherent. Through a series of benchmarks against standard tokenizers like GPT-2, BERT, and T5, we demonstrate that AG-BPE, despite using a more compact vocabulary, achieves superior vocabulary efficiency and perfect reconstruction fidelity. Qualitative analysis further reveals its unique ability to identify and preserve fundamental morphological units, offering a promising direction for creating more interpretable and compositionally effective vocabularies.

Files

Attention_Guided_BPE__AG_BPE_Théo_CHARLET.pdf

Files (204.9 kB)

Name Size Download all
md5:fcdb9bbd0f6cc46e7b938f6027cd1c4e
193.1 kB Preview Download
md5:2541f0ea9a534494baf066567e4774d4
11.7 kB Download

Additional details

Software

Repository URL
https://github.com/RDTvlokip/AG-BPE
Programming language
Python
Development Status
Active