Published December 29, 2025 | Version v1
Preprint Open

Orthographic Structure Matters: Tokenization Failures in Closely Related Languages

  • 1. ROR icon University of Colorado Boulder

Description

Multilingual evaluation often relies on lan- guage coverage or translated benchmarks, im- plicitly assuming that subword tokenization be- haves comparably across scripts. In mixed- script settings, this assumption breaks down. We examine this effect using polarity detec- tion as a case study, comparing Orthographic Syllable Pair Encoding (OSPE) and Byte Pair Encoding (BPE) under identical architectures, data, and training conditions on SemEval Task 9, which spans Devanagari, Perso-Arabic, and Latin scripts. OSPE is applied to Hindi, Nepali, Urdu, and Arabic, while BPE is retained for English. We find that BPE systematically un- derestimates performance in abugida and ab- jad scripts, producing fragmented representa- tions, unstable optimization, and drops of up to 27 macro-F1 points for Nepali, while English remains largely unaffected. Script-aware seg- mentation preserves orthographic structure, sta- bilizes training, and improves cross-language comparability without additional data or model scaling, highlighting tokenization as a latent but consequential evaluation decision in multi- lingual benchmarks.

Files

When_Multilingual_Evaluation_Assumptions_Fail_Tokenization_Effects_Across_Scripts.pdf