AG-BPE v4 : Enhanced Attention-Guided Tokenization

CHARLET, Théo

doi:10.5281/zenodo.16739553

Published August 4, 2025 | Version v4

Preprint Open

AG-BPE v4 : Enhanced Attention-Guided Tokenization

CHARLET, Théo

We present AG-BPE v4, an enhanced Attention-Guided Byte-Pair Encoding system that introduces weighted layer aggregation and robust production-ready features to achieve superior tokenization quality across diverse linguistic contexts. Building upon semantic-aware merge decisions, AG-BPE v4 incorporates a sophisticated attention mechanism that aggregates information from multiple transformer layers with learnable weights, emphasizing deeper, more semantically-aware representations. The system features advanced Unicode-based text preprocessing, comprehensive checkpoint recovery mechanisms, and memory optimization strategies for large-scale deployment. Our comprehensive benchmarks demonstrate that AG-BPE v4 achieves an optimal effectiveness-per-KB ratio of 0.0149 with 3.85x compression while maintaining exceptional decoding speed (0.03ms) and perfect robustness on multilingual text, including complex scripts like Korean and mathematical symbols. Qualitative analysis reveals enhanced morphological awareness and remarkable zero-shot cross-lingual generalization, where a French-trained model correctly segments morphemes across languages and scripts. The weighted layer approach consistently outperforms alternatives, with ablation studies confirming that deeper transformer layers provide more semantically meaningful attention patterns for intelligent tokenization decisions.

Files

AG-BPE v4 ; Enhanced Attention-Guided Tokenization ; Théo CHARLET.pdf

Files (1.2 MB)

Name	Size	Download all
AG-BPE v4 ; Enhanced Attention-Guided Tokenization ; Théo CHARLET.pdf md5:64dea0b181907c4da70ee887de4fcfd5	616.8 kB	Preview Download
AG-BPE v4 ; Enhanced Attention-Guided Tokenization ; Théo CHARLET.tex md5:fb32942cb3c6718694fba53e219f1117	23.3 kB	Download
ag_bpe_tokenizer_v4.json md5:6a448e1f8e6c90db99cdcb6d92745db0	550.8 kB	Preview Download

Additional details

Repository URL: https://github.com/RDTvlokip/AG-BPE
Programming language: Python
Development Status: Active

	All versions	This version
Views	536	60
Downloads	478	60
Data volume	153.9 MB	38.3 MB

AG-BPE v4 : Enhanced Attention-Guided Tokenization

Creators

Description

Files

AG-BPE v4 ; Enhanced Attention-Guided Tokenization ; Théo CHARLET.pdf

Files (1.2 MB)

Additional details

Software