Published April 2, 2026 | Version v1
Preprint Open

Concept-as-Byte Language Model Encoding Preprint

Description

Concept-as-Byte is a novel encoding architecture for language model training that replaces subword tokenization (BPE, WordPiece, SentencePiece) with dense morphological byte encoding. Each byte represents a complete semantic concept rather than a statistical subword fragment. The morpheme inventory is drawn from Zamenhof's Esperanto system (1887-1894), providing approximately 1,700 atomic morphemes that compose into any expressible concept through systematic combination with 41 derivational affixes and 12 grammatical endings. This preprint presents preliminary training results from a 134.7M parameter model achieving a loss of 1.8257 in 6.5 hours of training on a single GPU. The encoding table, encoder/decoder implementation, model architectures (13.2M and 134.7M), and training logs are provided at git repository: https://github.com/Laninthalesdran/Concept-as-Byte. U.S. Patent Application No. 64/017,122, filed March 25, 2026.

Files

Concept-as-Byte_Preprint.pdf

Files (21.6 kB)

Name Size Download all
md5:8867cd0f4d67285b8d44405f5de3947c
21.6 kB Preview Download