Published December 28, 2024 | Version v1
Journal article Open

Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages

  • 1. Stellenbosch University, Stellenbosch, South Africa

Description

To address the gap in natural language processing for Southern African languages, our paper presents an in-depth analysis of language model development under resource-constrained conditions. We investigate the interplay between model size, pretraining objectives, and multilingual dataset composition in the context of low-resource languages such as Zulu and Xhosa. In our approach, we initially pretrain language models from scratch on specific low-resource languages using a variety of model configurations, and incrementally add related languages to explore the effect of additional languages on the performance of these models. We demonstrate that smaller data volumes can be effectively leveraged, and that the choice of pretraining objective and multilingual dataset composition significantly influences model performance. Our monolingual and multilingual models, exhibit competitive, and in some cases superior, performance compared to established multilingual models such as XLM-R-base and AfroXLM-R-base.

Files

jucs_article_118889.pdf

Files (1.1 MB)

Name Size Download all
md5:bfffee6cd59996423dedece172d17fdc
1.1 MB Preview Download

System files (6.4 kB)

Name Size Download all
md5:161086c48479be107b0beb0acde4d309
6.4 kB Download