Published August 12, 2020 | Version v1
Journal article Open

ROBUST EXTENDED TOKENIZATION FRAMEWORK FOR ROMANIAN BY SEMANTIC PARALLEL TEXTS PROCESSING

Description

Tokenization is considered a solved problem when reduced to just word borders identification, punctuation and white spaces handling. Obtaining a high quality outcome from this process is essential for subsequent NLP piped processes (POS-tagging, WSD). In this paper we claim that to obtain this quality we need to use in the tokenization disambiguation process all linguistic, morphosyntactic, and semantic-level word-related information as necessary. We also claim that semantic disambiguation performs much better in a bilingual context than in a monolingual one. Then we prove that for the disambiguation purposes the bilingual text provided by high profile on-line machine translation services performs almost to the same level with human-originated parallel texts (Gold standard). Finally we claim that the tokenization algorithm incorporated in TORO can be used as a criterion for on-line machine translation services comparative quality assessment and we provide a setup for this purpose.

Files

1.pdf

Files (385.6 kB)

Name Size Download all
md5:22139aecda15186d5b48676a50a7e1fc
385.6 kB Preview Download