Published May 3, 2023 | Version v1
Poster Open

On treebank development for under-resourced languages with active annotation

  • 1. Athena-Research and Innovation Center in Information, Communication and Knowledge Technologies


About half of the living languages and dialects, including the Greek ones, are endangered, and language loss occurs at an accelerated rate because of globalization and neocolonialism. In addition, only few languages are endowed with the resources ensuring their survival in the AI era. Saving and revitalizing the linguistic heritage of humanity has become important for maintaining global cultural diversity. Natural language processing (NLP) for endangered and under-resourced languages can be beneficial for their preservation and documentation; the latter is a challenge as endangered languages typically lack written resources. Language technologies enable the development of digital archives and linguistic resources by automating technical processes and allowing language documentation practitioners to focus on the linguistic aspects. Furthermore, NLP can provide insights into the unique linguistic features of endangered languages, aiding in their preservation and documentation. In line with these concepts and facts, the project Philotis has developed a workflow and platform to support the multimodal documentation of living languages. Advanced NLP technology is utilized to develop spoken and textual corpora (up to the level of a treebank) from raw documentation materials with a workflow that automates NLP processes and allows language documentation practitioners to focus on linguistic aspects. The Philotis workflow accommodates several documentation scenarios, and was tested on Pomak, an endangered oral Slavic language of the Balkans. Several state-of-the-art NLP tools that can be integrated into different NLP pipelines suitable for under-resourced languages were evaluated under this framework. 

Developing treebanks for languages is a difficult and extremely time-consuming process. Active learning approaches have been proposed to automate part of the process and reduce the total annotation duration and cost. Crucially, under-resourced languages typically provide (severely) limited amounts of data and few, if any, experts for data development and annotation. A practical active annotation strategy for such a case could be implemented in the form of an online learning approach. This approach uses randomly selected sentences for a loop including annotation prediction, manual correction, and model retraining. To implement this approach in a realistic scenario, we used 300 annotated sentences from the Pomak corpus published by Philotis on the Universal Dependencies repository. By utilizing a simple weighted summation of four potential annotation errors (lemmas, part of speech, dependency pair, and dependency label), we run several experiments of online annotation, which revealed an underlying optimal strategy. This strategy resulted in a significant decrease in the total annotation duration by 54% and a corresponding decrease in the total cost by 63% compared to manual annotation.



Files (3.0 MB)

Name Size Download all
3.0 MB Preview Download