Published October 31, 2025 | Version 0.5
Dataset Open

Mafoko Companion Dataset for Mafoko Open Multilingual Terminologies Paper

Description

The Mafoko project systematically aggregates, digitises, and standardises fragmented multilingual terminological resources for South Africa’s official languages. Sourced from government and academic repositories, these terminologies have historically been locked in non-machine-readable formats and inaccessible structures, limiting their use for linguistic research and NLP development.

This release provides the foundational Mafoko dataset, curated under the equitable Africa-centered NOODL licensing framework. Data is released in open, machine-readable formats (CSV/JSON) with provenance metadata and ISO language identifiers.

Author list

  • Vukosi Marivate (University of Pretoria; AfriDSAI; Lelapa AI)
  • Isheanesu Dzingirai (University of Pretoria)
  • Fiskani Banda (University of Pretoria)
  • Richard Lastrucci (University of Pretoria)
  • Thapelo Sindane (University of Pretoria)
  • Keabetswe Madumo (University of Pretoria)
  • Kayode Olaleye (University of Pretoria)
  • Abiodun Modupe (University of Pretoria)
  • Unarine Netshifhefhe (University of Pretoria)
  • Herkulaas Combrink (University of the Free State)
  • Mohlatlego Nakeng (University of Pretoria)
  • Matome Ledwaba (University of Pretoria)

Corresponding author: vukosi.marivate@cs.up.ac.za

Files

NOODL _Plain‑Language Explainer [V4].pdf

Files (5.2 MB)

Name Size Download all
md5:6f523ac9b7a72cac768f4a57f210b702
225.2 kB Preview Download
md5:4dfad7a329c1e1ec25307a273d15cb88
145.2 kB Preview Download
md5:c1134e8243448570861b561606f79196
4.4 MB Preview Download
md5:95da4c15478500b3c25df973199c333b
432.8 kB Preview Download

Additional details

Related works

Is supplement to
Preprint: arXiv:2508.03529 (arXiv)

Software

Repository URL
https://github.com/dsfsi/za-mafoko
Programming language
Python
Development Status
Active