AG-BPE: Attention-Guided Byte-Pair Encoding for Semantic-Aware Tokenization
Creators
Description
Standard subword tokenization methods like Byte-Pair Encoding (BPE) are foundational to modern LLMs but often create suboptimal, semantically incoherent tokens. We introduce Attention-Guided BPE (AG-BPE), a novel approach that enhances the BPE algorithm by incorporating a semantic-aware guidance mechanism. A lightweight Transformer encoder informs merge decisions via a hybrid score, favoring tokens that are both frequent and semantically coherent. Benchmarked against industry standards, AG-BPE, trained on a modest 302 MB dataset, demonstrates a state-of-the-art compression ratio (3.77x) with a vocabulary up to 12 times more compact than its competitors. It combines this efficiency with perfect robustness on modern, multilingual text and a decoding speed over 30x faster. Qualitative analysis reveals its unique ability to identify and preserve fundamental morphological units, offering a promising direction for more efficient and interpretable language models.
Files
Attention_Guided_BPE__AG_BPE_Théo_CHARLET.pdf
Additional details
Software
- Repository URL
- https://github.com/RDTvlokip/AG-BPE
- Programming language
- Python
- Development Status
- Active