Orthographic Structure Matters: Tokenization Failures in Closely Related Languages
Description
Multilingual evaluation often relies on lan- guage coverage or translated benchmarks, im- plicitly assuming that subword tokenization be- haves comparably across scripts. In mixed- script settings, this assumption breaks down. We examine this effect using polarity detec- tion as a case study, comparing Orthographic Syllable Pair Encoding (OSPE) and Byte Pair Encoding (BPE) under identical architectures, data, and training conditions on SemEval Task 9, which spans Devanagari, Perso-Arabic, and Latin scripts. OSPE is applied to Hindi, Nepali, Urdu, and Arabic, while BPE is retained for English. We find that BPE systematically un- derestimates performance in abugida and ab- jad scripts, producing fragmented representa- tions, unstable optimization, and drops of up to 27 macro-F1 points for Nepali, while English remains largely unaffected. Script-aware seg- mentation preserves orthographic structure, sta- bilizes training, and improves cross-language comparability without additional data or model scaling, highlighting tokenization as a latent but consequential evaluation decision in multi- lingual benchmarks.
Files
When_Multilingual_Evaluation_Assumptions_Fail_Tokenization_Effects_Across_Scripts.pdf
Files
(153.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:1b42b0f5b7cdec92190b96a985a183b7
|
153.0 kB | Preview Download |