AG-BPE: Advanced Benchmarking and Dataset Improvements
Creators
Description
Standard subword tokenization methods like Byte-Pair Encoding (BPE) are foundational to modern large language models but operate on purely statistical frequency, ignoring the semantic coherence of the tokens they create. This can lead to suboptimal segmentation that splits meaningful morphological units. We introduce Attention-Guided BPE (AG-BPE), a novel approach that enhances the BPE algorithm by incorporating a semantic-aware guidance mechanism. Instead of relying solely on frequency, AG-BPE's merge decisions are informed by a hybrid score that combines co-occurrence statistics with contextual attention scores from a lightweight Transformer encoder. This process favors the creation of tokens that are not only frequent but also semantically coherent. Through a series of benchmarks against standard tokenizers like GPT-2, BERT, and T5, we demonstrate that AG-BPE, despite using a more compact vocabulary, achieves superior vocabulary efficiency and perfect reconstruction fidelity. Qualitative analysis further reveals its unique ability to identify and preserve fundamental morphological units, offering a promising direction for creating more interpretable and compositionally effective vocabularies.
Files
Attention_Guided_BPE__AG_BPE_Théo_CHARLET.pdf
Files
(204.9 kB)
Name | Size | Download all |
---|---|---|
md5:fcdb9bbd0f6cc46e7b938f6027cd1c4e
|
193.1 kB | Preview Download |
md5:2541f0ea9a534494baf066567e4774d4
|
11.7 kB | Download |
Additional details
Software
- Repository URL
- https://github.com/RDTvlokip/AG-BPE
- Programming language
- Python
- Development Status
- Active