ROBUST EXTENDED TOKENIZATION FRAMEWORK FOR ROMANIAN BY SEMANTIC PARALLEL TEXTS PROCESSING

Eng. Marius Zubac1  and Prof. PhD Eng. Vasile Dădârlat2

doi:10.5281/zenodo.3980607

Published August 12, 2020 | Version v1

Journal article Open

ROBUST EXTENDED TOKENIZATION FRAMEWORK FOR ROMANIAN BY SEMANTIC PARALLEL TEXTS PROCESSING

Eng. Marius Zubac1 and Prof. PhD Eng. Vasile Dădârlat2

Tokenization is considered a solved problem when reduced to just word borders identification, punctuation and white spaces handling. Obtaining a high quality outcome from this process is essential for subsequent NLP piped processes (POS-tagging, WSD). In this paper we claim that to obtain this quality we need to use in the tokenization disambiguation process all linguistic, morphosyntactic, and semantic-level word-related information as necessary. We also claim that semantic disambiguation performs much better in a bilingual context than in a monolingual one. Then we prove that for the disambiguation purposes the bilingual text provided by high profile on-line machine translation services performs almost to the same level with human-originated parallel texts (Gold standard). Finally we claim that the tokenization algorithm incorporated in TORO can be used as a criterion for on-line machine translation services comparative quality assessment and we provide a setup for this purpose.

Files

1.pdf

Files (385.6 kB)

Name	Size	Download all
1.pdf md5:22139aecda15186d5b48676a50a7e1fc	385.6 kB	Preview Download

	All versions	This version
Views	50	50
Downloads	120	120
Data volume	46.7 MB	46.7 MB

ROBUST EXTENDED TOKENIZATION FRAMEWORK FOR ROMANIAN BY SEMANTIC PARALLEL TEXTS PROCESSING

Authors/Creators

Description

Files

1.pdf

Files (385.6 kB)