Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages

Visser, Ruan; Grobler, Trieko; Dunaiski, Marcel

doi:10.3897/jucs.118889

Published December 28, 2024 | Version v1

Journal article Open

Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages

1. Stellenbosch University, Stellenbosch, South Africa

To address the gap in natural language processing for Southern African languages, our paper presents an in-depth analysis of language model development under resource-constrained conditions. We investigate the interplay between model size, pretraining objectives, and multilingual dataset composition in the context of low-resource languages such as Zulu and Xhosa. In our approach, we initially pretrain language models from scratch on specific low-resource languages using a variety of model configurations, and incrementally add related languages to explore the effect of additional languages on the performance of these models. We demonstrate that smaller data volumes can be effectively leveraged, and that the choice of pretraining objective and multilingual dataset composition significantly influences model performance. Our monolingual and multilingual models, exhibit competitive, and in some cases superior, performance compared to established multilingual models such as XLM-R-base and AfroXLM-R-base.

Files

jucs_article_118889.pdf

Files (1.1 MB)

Name	Size	Download all
jucs_article_118889.pdf md5:bfffee6cd59996423dedece172d17fdc	1.1 MB	Preview Download

System files (6.4 kB)

Name	Size	Download all
application/vnd.taxpub.v1+xml md5:161086c48479be107b0beb0acde4d309	6.4 kB	Download

	All versions	This version
Views	29	29
Downloads	35	35
Data volume	31.3 MB	31.3 MB

Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages

Authors/Creators

Description

Files

jucs_article_118889.pdf

Files (1.1 MB)

System files (6.4 kB)