AG-BPE: Advanced Benchmarking and Dataset Improvements

CHARLET, Théo

doi:10.5281/zenodo.15806375

Published July 24, 2025 | Version 2.0.0

Preprint Open

AG-BPE: Advanced Benchmarking and Dataset Improvements

CHARLET, Théo

Standard subword tokenization methods like Byte-Pair Encoding (BPE) are foundational to modern large language models but operate on purely statistical frequency, ignoring the semantic coherence of the tokens they create. This can lead to suboptimal segmentation that splits meaningful morphological units. We introduce Attention-Guided BPE (AG-BPE), a novel approach that enhances the BPE algorithm by incorporating a semantic-aware guidance mechanism. Instead of relying solely on frequency, AG-BPE's merge decisions are informed by a hybrid score that combines co-occurrence statistics with contextual attention scores from a lightweight Transformer encoder. This process favors the creation of tokens that are not only frequent but also semantically coherent. Through a series of benchmarks against standard tokenizers like GPT-2, BERT, and T5, we demonstrate that AG-BPE, despite using a more compact vocabulary, achieves superior vocabulary efficiency and perfect reconstruction fidelity. Qualitative analysis further reveals its unique ability to identify and preserve fundamental morphological units, offering a promising direction for creating more interpretable and compositionally effective vocabularies.

Files

Attention_Guided_BPE__AG_BPE_Théo_CHARLET.pdf

Files (204.9 kB)

Name	Size	Download all
Attention_Guided_BPE__AG_BPE_Théo_CHARLET.pdf md5:fcdb9bbd0f6cc46e7b938f6027cd1c4e	193.1 kB	Preview Download
Attention_Guided_BPE__AG_BPE_Théo_CHARLET.tex md5:2541f0ea9a534494baf066567e4774d4	11.7 kB	Download

Additional details

Repository URL: https://github.com/RDTvlokip/AG-BPE
Programming language: Python
Development Status: Active

	All versions	This version
Views	552	30
Downloads	494	31
Data volume	162.7 MB	8.1 MB

AG-BPE: Advanced Benchmarking and Dataset Improvements

Creators

Description

Files

Attention_Guided_BPE__AG_BPE_Théo_CHARLET.pdf

Files (204.9 kB)

Additional details

Software