Published August 4, 2025 | Version v4
Preprint Open

AG-BPE v4 : Enhanced Attention-Guided Tokenization

Description

We present AG-BPE v4, an enhanced Attention-Guided Byte-Pair Encoding system that introduces weighted layer aggregation and robust production-ready features to achieve superior tokenization quality across diverse linguistic contexts. Building upon semantic-aware merge decisions, AG-BPE v4 incorporates a sophisticated attention mechanism that aggregates information from multiple transformer layers with learnable weights, emphasizing deeper, more semantically-aware representations. The system features advanced Unicode-based text preprocessing, comprehensive checkpoint recovery mechanisms, and memory optimization strategies for large-scale deployment. Our comprehensive benchmarks demonstrate that AG-BPE v4 achieves an optimal effectiveness-per-KB ratio of 0.0149 with 3.85x compression while maintaining exceptional decoding speed (0.03ms) and perfect robustness on multilingual text, including complex scripts like Korean and mathematical symbols. Qualitative analysis reveals enhanced morphological awareness and remarkable zero-shot cross-lingual generalization, where a French-trained model correctly segments morphemes across languages and scripts. The weighted layer approach consistently outperforms alternatives, with ablation studies confirming that deeper transformer layers provide more semantically meaningful attention patterns for intelligent tokenization decisions.

Files

AG-BPE v4 ; Enhanced Attention-Guided Tokenization ; Théo CHARLET.pdf

Files (1.2 MB)

Additional details

Software

Repository URL
https://github.com/RDTvlokip/AG-BPE
Programming language
Python
Development Status
Active