Concept-as-Byte Language Model Encoding Preprint

Holley, Travis

doi:10.5281/zenodo.19390531

Published April 2, 2026 | Version v1

Preprint Open

Concept-as-Byte Language Model Encoding Preprint

Holley, Travis (Researcher)

Concept-as-Byte is a novel encoding architecture for language model training that replaces subword tokenization (BPE, WordPiece, SentencePiece) with dense morphological byte encoding. Each byte represents a complete semantic concept rather than a statistical subword fragment. The morpheme inventory is drawn from Zamenhof's Esperanto system (1887-1894), providing approximately 1,700 atomic morphemes that compose into any expressible concept through systematic combination with 41 derivational affixes and 12 grammatical endings. This preprint presents preliminary training results from a 134.7M parameter model achieving a loss of 1.8257 in 6.5 hours of training on a single GPU. The encoding table, encoder/decoder implementation, model architectures (13.2M and 134.7M), and training logs are provided at git repository: https://github.com/Laninthalesdran/Concept-as-Byte. U.S. Patent Application No. 64/017,122, filed March 25, 2026.

Files

Concept-as-Byte_Preprint.pdf

Files (21.6 kB)

Name	Size	Download all
Concept-as-Byte_Preprint.pdf md5:8867cd0f4d67285b8d44405f5de3947c	21.6 kB	Preview Download

	All versions	This version
Views	31	31
Downloads	17	17
Data volume	453.5 kB	453.5 kB

Concept-as-Byte Language Model Encoding Preprint

Authors/Creators

Description

Files

Concept-as-Byte_Preprint.pdf

Files (21.6 kB)