Published February 3, 2023 | Version 0.5
Dataset Open

The Vuk'uzenzele South African Multilingual Corpus

  • 1. University of Pretoria
  • 2. University of the Witwatersrand

Description

# The Vuk'uzenzele South African Multilingual Corpus
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7598539.svg)](https://doi.org/10.5281/zenodo.7598539)

Github: https://github.com/dsfsi/vukuzenzele-nlp

## About dataset
The dataset contains editions from the South African government magazine Vuk'uzenzele. Data was scraped from PDFs that have been placed in the [data/raw](data/raw/) folder.
The PDFS were obtained from the [Vuk'uzenzele website](https://www.vukuzenzele.gov.za/).

The datasets contain government magazine editions in 11 languages, namely:

|  Language  | Code |  Language  | Code |
|------------|-------|------------|-------|
| English    | (eng) | Sepedi     | (sep) |
| Afrikaans  | (afr) | Setswana   | (tsn) |
| isiNdebele | (nbl) | Siswati    | (ssw) |
| isiXhosa   | (xho) | Tshivenda  | (ven) |
| isiZulu    | (zul) | Xitstonga  | (tso) |
| Sesotho    | (nso) |

### Number of Aligned Pairs with Cosine Similarity Score >= 0.65

| src_lang | trg_lang | num_aligned_pairs |
|----------|----------|-------------------|
| ven      | zul      | 186               |
| ssw      | xho      | 1965              |
| sep      | xho      | 279               |
| nbl      | zul      | 227               |
| nso      | tsn      | 1279              |
| nso      | tso      | 1491              |
| tsn      | zul      | 1346              |
| afr      | eng      | 1369              |
| eng      | ssw      | 1601              |
| afr      | ssw      | 1496              |
| nbl      | ssw      | 264               |
| tso      | zul      | 1758              |
| afr      | zul      | 1384              |
| eng      | zul      | 1888              |
| ssw      | tsn      | 1263              |
| sep      | tsn      | 302               |
| nso      | xho      | 1248              |
| sep      | tso      | 324               |
| ssw      | tso      | 1657              |
| tsn      | ven      | 235               |
| eng      | nbl      | 153               |
| nso      | sep      | 349               |
| afr      | nbl      | 359               |
| nbl      | ven      | 657               |
| eng      | ven      | 243               |
| afr      | ven      | 281               |
| tso      | ven      | 256               |
| ven      | xho      | 215               |
| eng      | tsn      | 1380              |
| afr      | tsn      | 1076              |
| nso      | ssw      | 1132              |
| eng      | tso      | 2016              |
| afr      | tso      | 1139              |
| xho      | zul      | 1895              |
| tsn      | xho      | 1209              |
| sep      | zul      | 223               |
| nbl      | xho      | 204               |
| ssw      | zul      | 2161              |
| afr      | xho      | 1363              |
| eng      | xho      | 1354              |
| tso      | xho      | 1485              |
| sep      | ssw      | 219               |
| nbl      | tso      | 215               |
| tsn      | tso      | 1570              |
| nso      | zul      | 1247              |
| nbl      | tsn      | 140               |
| eng      | sep      | 276               |
| afr      | sep      | 394               |
| ssw      | ven      | 217               |
| sep      | ven      | 1140              |
| afr      | nso      | 962               |
| eng      | nso      | 1721              |
| nbl      | nso      | 151               |
| nbl      | sep      | 843               |
| nso      | ven      | 262               |


The dataset is present in several forms on the repo. 
Generally the dataset is split by edition, eg. `2020-01-ed1`  
The data directory is broken down as follows
```
./data
├── external                # Data external to this repo
├── interim                 # I am not really sure - looks like interim in regards to processed.
├── processed               # The data from scraping the raw pdfs
├── raw                     # The raw pdfs of the Vuk'uzenzele magazine
├── sentence_align_output   # The output (csv) of the sentence alignment with LASER language encoders
└── simple_align_output     # The output (csv) of a simple one to one sentence alignment
```
The dataset is split by edition in the [data/processed](data/processed/) folder.

Authors
-------
- Vukosi Marivate - [@vukosi](https://twitter.com/vukosi)
- Andani Madodonga
- Daniel Njini
- Richard Lastrucci

Citation
--------
Vukosi Marivate, Andani Madodonga, Daniel Njini, Richard Lastrucci, Isheanesu Dzingirai
. **The Vuk'uzenzele South African Multilingual Corpus**, 2023

> @inproceedings{lastrucci-etal-2023-preparing,
    title = "Preparing the Vuk{'}uzenzele and {ZA}-gov-multilingual {S}outh {A}frican multilingual corpora",
    author = "Richard Lastrucci and Isheanesu Dzingirai and Jenalea Rajab and Andani Madodonga and Matimba Shingange and Daniel Njini and Vukosi Marivate",
    booktitle = "Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.rail-1.3",
    pages = "18--25"
}

> @dataset{marivate_vukosi_2023_7598540,
  author       = {Marivate, Vukosi and
                  Njini, Daniel and
                  Madodonga, Andani and
                  Lastrucci, Richard and
                  Dzingirai, Isheanesu},
  title        = {The Vuk'uzenzele South African Multilingual Corpus},
  month        = feb,
  year         = 2023,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.7598539},
  url          = {https://doi.org/10.5281/zenodo.7598539}
}

Licences
-------
* License for Data - [CC 4.0 BY SA](LICENSE.data.md)
* Licence for Code - [MIT License](LICENSE.md)

Files

DATASHEET.md

Files (27.6 MB)

Name Size Download all
md5:819439d19419f1827d24c84a98e5f32c
21.9 kB Preview Download
md5:38666c7a676cc7332fab6a2d676b4c7a
3.0 MB Preview Download
md5:8873598f6dc62aaf8c359ea6ce54d1d1
2.0 kB Preview Download
md5:4051b59f69919950d185347997550ac7
24.6 MB Preview Download

Additional details

Related works

Is published in
Preprint: 10.48550/arXiv.2303.03750 (DOI)
Conference paper: https://aclanthology.org/2023.rail-1.3/ (URL)