Published September 14, 2025 | Version 6.0
Preprint Open

Intelligent Tokenizer: Attention Needs No Vocabulary

Description

Title offered as an homage to "Attention Is All You Need" (Vaswani et al., 2017)

Current tokenization methods rely heavily on language-specific rules and pre-defined vocabularies, limiting their applicability to new languages and domains. We present Intelligent Tokenizer, a pure learning-based approach that processes text at the byte level without any linguistic rules or vocabulary files. Our model learns language patterns directly from raw UTF-8 bytes through a hierarchical attention mechanism, achieving vocabulary-free tokenization across 204 languages. The key innovation lies in separating tokenization from language models, enabling efficient resource utilization and improved generalization. With only 105M parameters, our model demonstrates 95% reconstruction accuracy for English while maintaining real-time processing capabilities through 256-byte chunking. This work represents a step toward universal, language-agnostic AI systems that can adapt to any text format without manual configuration.

Files

Intelligent Tokenizer.pdf

Files (130.1 kB)

Name Size Download all
md5:e924bdf5cc991a6f52d7b841bf412b0e
130.1 kB Preview Download

Additional details

Software

Repository URL
https://github.com/Woojiggun/intelligent-tokenizer
Programming language
Python
Development Status
Active