Published June 17, 2026 | Version v1

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Description

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple ⟨O, classify, {instτ }⟩: the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the indexed family of instantiators produces a self-descriptive structured representation. Robustness derives from deterministic coupling with three consolidated external oracles — Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). The intrinsic evaluation covers four properties verifiable by construction—ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction — over an internally generated and physically validated benchmark (EngQuant, N = 800) and four external corpora in Brazilian Portuguese (N = 1 771 cases eligible for numerical reconstruction). We additionally report detection recall, distinguishing coverage from conditional atomicity. Compared to eight representative state-of-the-art systems, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775 to 0.904 on external corpora, against 0.627 0.703 for the best baseline (Quantulum3); on the internal benchmark, 0.780 against 0.340. Differences in atomicity and reconstruction are statistically significant (McNemar with Holm correction). The Spearman rank correlation between internal and external corpus rankings confirms the concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

Files

toten_paper_preprint.pdf

Files (1.3 MB)

Name Size Download all
md5:ac06a37213bffbff68a1c5edc88eeb1e
1.3 MB Preview Download