Intelligent Tokenizer: Attention Needs No Vocabulary

Woo, Jinhyun

doi:10.5281/zenodo.17116281

Published September 14, 2025 | Version 6.0

Preprint Open

Intelligent Tokenizer: Attention Needs No Vocabulary

Woo, Jinhyun (Researcher)

Title offered as an homage to "Attention Is All You Need" (Vaswani et al., 2017)

Current tokenization methods rely heavily on language-specific rules and pre-defined vocabularies, limiting their applicability to new languages and domains. We present Intelligent Tokenizer, a pure learning-based approach that processes text at the byte level without any linguistic rules or vocabulary files. Our model learns language patterns directly from raw UTF-8 bytes through a hierarchical attention mechanism, achieving vocabulary-free tokenization across 204 languages. The key innovation lies in separating tokenization from language models, enabling efficient resource utilization and improved generalization. With only 105M parameters, our model demonstrates 95% reconstruction accuracy for English while maintaining real-time processing capabilities through 256-byte chunking. This work represents a step toward universal, language-agnostic AI systems that can adapt to any text format without manual configuration.

Files

Intelligent Tokenizer.pdf

Files (130.1 kB)

Name	Size	Download all
Intelligent Tokenizer.pdf md5:e924bdf5cc991a6f52d7b841bf412b0e	130.1 kB	Preview Download

Additional details

Repository URL: https://github.com/Woojiggun/intelligent-tokenizer
Programming language: Python
Development Status: Active

Views

Downloads

Show more details

	All versions	This version
Views	50	50
Downloads	32	32
Data volume	7.2 MB	7.2 MB

More info on how stats are collected....

DOI

Resource type

Preprint

Publisher

Zenodo

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more
Copyright: Copyright (C) 2025 Jinhyun Woo. This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Technical metadata

Created: September 14, 2025
Modified: September 15, 2025

Intelligent Tokenizer: Attention Needs No Vocabulary

Creators

Description

Files

Intelligent Tokenizer.pdf

Files (130.1 kB)

Additional details

Software