Intelligent Tokenizer: Attention Needs No Vocabulary
Creators
Description
Title offered as an homage to "Attention Is All You Need" (Vaswani et al., 2017)
Current tokenization methods rely heavily on language-specific rules and pre-defined vocabularies, limiting their applicability to new languages and domains. We present Intelligent Tokenizer, a pure learning-based approach that processes text at the byte level without any linguistic rules or vocabulary files. Our model learns language patterns directly from raw UTF-8 bytes through a hierarchical attention mechanism, achieving vocabulary-free tokenization across 204 languages. The key innovation lies in separating tokenization from language models, enabling efficient resource utilization and improved generalization. With only 105M parameters, our model demonstrates 95% reconstruction accuracy for English while maintaining real-time processing capabilities through 256-byte chunking. This work represents a step toward universal, language-agnostic AI systems that can adapt to any text format without manual configuration.
Files
Intelligent Tokenizer.pdf
Files
(130.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:e924bdf5cc991a6f52d7b841bf412b0e
|
130.1 kB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/Woojiggun/intelligent-tokenizer
- Programming language
- Python
- Development Status
- Active