Philotis: NLP Informed Language Documentation of Under-Resourced Languages - The Case of Pomak
Creators
- 1. Athena-Research and Innovation Center in Information, Communication and Knowledge Technologies
Description
The Philotis Project develops a platform for the multimodal documentation of living languages; the documentation materials are processed with state-of-the-art NLP technology that renders them suitable for downstream applications. Philotis supports language documentation practitioners by automating the required technical processes so that they can focus on the linguistic aspects of resource collection and annotation. The platform supports several scenarios regarding the available documentation materials and the required processing pipelines. Pomak, an endangered oral Slavic language of the Balkans, is used as a case study. Documentation of Pomak is an ongoing project by an interdisciplinary team working closely with the Pomak community of Greece. We have developed and published a gold morphologically annotated corpus of Pomak; morphological annotation has greatly profited from the electronic lexicon of Pomak "Rodopsky” that contains about 3.5 x 106 morphologically annotated forms. Rodopsky was converted into the CONLLU format and its morphological annotation to the Universal Dependencies (UDs) scheme. We evaluated several openly available state-of-the-art NLP tools wrt their ability to be integrated into different NLP pipelines suitable for under resourced languages. We identified two problems (i) languages not already registered in the UDs-repository were difficult to be processed by spaCy & stanza, (ii) the syntactical information was prerequired for the morphological task (spaCy, stanza); however, a language might provide precious information for either of these levels and, it is widely accepted, also by UDs, that syntax builds on morphology. Recently, the option of separate processing has been adopted by Stanza based on our experience from Pomak. We are now experimenting with active annotation in order to assign dependency relations aiming at a considerable reduction of annotation costs as regards the (semi-)automatic assignment of syntactic dependencies to the Pomak corpus. Our first results indicate that quality morphological analysis directly affects the accuracy of syntactic dependencies.
Files
poster.pdf
Files
(2.6 MB)
Name | Size | Download all |
---|---|---|
md5:3cb1c28fc5367f445bb9404f3db498f5
|
2.6 MB | Preview Download |