The South African Next Voices Multilingual Speech Dataset [Compressed]

Marivate, Vukosi; Olaleye, Kayode; Mundia, Sitwala; Bakainga, Andinda; Netshifhefhe, Unarine Leo; Milanzie, Mahmooda; Mogale, Hope; SINDANE, THAPELO; Abdulrasaq, Zainab; Mokgosi, Kesego; Okorie, Chijioke; van Wyk, Nia Zion; Morrissey, Graham; Dunbar, Dale; Smit, Francois; Chidi, Tsosheletso; Mabuya, Rooweither; Bukula, Andiswa; MLAMBO, RESPECT; Macucwa, Solomon Tebogo; Abdulmumin, Idris; Rananga, Seani

doi:10.5281/zenodo.17776290

Published November 30, 2025 | Version 1.0

Dataset Restricted

The South African Next Voices Multilingual Speech Dataset [Compressed]

1. University of Pretoria
2. Technological University Dublin
3. Penguide Advisory
4. South African Centre for Digital Language Resources
5. Pennsylvania State University
6. North-West University

Swivuriso: ZA-African Next Voices-Compressed

Swivuriso is a large-scale multilingual speech dataset targeting over 3000 hours of audio across 7 South African languages. The dataset is developed to support Automatic Speech Recognition (ASR) and inclusive speech technologies for low-resource African languages. It combines both scripted and unscripted speech, collected through ethical, community-centered processes.

Dataset Paper: ArXiv - Work in Progress

This is a compressed version of the original dataset https://huggingface.co/datasets/dsfsi-anv/za-african-next-voices/

⚠️ IMPORTANT: Visit the original dataset for full details

Language Coverage

Language	Target Hours	Released
isiZulu	500	▇▇▇▇▇▇▇▇▇▇ 100%
isiXhosa	500	▇▇▇▇▇▇▇▇▇▇ 100%
Sesotho	500	▇▇▇▇▇▇▇▇▇▇ 100%
Setswana	500	▇▇▇▇▇▇▇▇▇▇ 100%
Xitsonga	500	▇▇▇▇▇▇▇▇▇▇ 100%
isiNdebele	250	▇▇▇▇▇▇▇▇▇▇ 100%
Tshivenda	250	▇▇▇▇▇▇▇▇▇▇ 100%

Use Restriction:

The persons whose voices are included in this dataset, and the creators and owners of this dataset* do not give consent in any manner or form to, and strictly prohibit any use of this dataset for any form of text-to-speech (TTS), voice cloning, voice synthesis, or any technology or activity intended to replicate, mimic or generate human voices or any technology or activity resulting in the replication, mimicry or generation of human voices.

This dataset includes scripted and unscripted speech across various domains such as agriculture, health, finance, sports, transport, culture, society, and general topics. It is primarily designed for use in automatic speech recognition (ASR) tasks.

Use of this dataset for any form of text-to-speech (TTS), voice cloning, voice synthesis, or any technology intended to replicate or generate human voices is strictly prohibited.

These restrictions are in place until further notice.

Citations

If you use Swivuriso in your work, please cite both of the below:

Dataset

@dataset{za-african-next-voices-2025,
  title     = {The South African Next Voices Multilingual Speech Dataset},
    author       = {Marivate, Vukosi and
                  Olaleye, Kayode and
                  Mundia, Sitwala and
                  Bakainga, Andinda and
                  Netshifhefhe, Unarine Leo and
                  Milanzie, Mahmooda and
                  Mogale, Hope and
                  SINDANE, THAPELO and
                  Abdulrasaq, Zainab and
                  Mokgosi, Kesego and
                  Okorie, Chijioke and
                  van Wyk, Nia Zion and
                  Morrissey, Graham and
                  Dunbar, Dale and
                  Smit, Francois and
                  Chidi, Tsosheletso and
                  Mabuya, Rooweither and
                  Bukula, Andiswa and
                  MLAMBO, RESPECT and
                  Macucwa, Solomon Tebogo and
                  Abdulmumin, Idris and
                  Rananga, Seani},
  url2      = {https://github.com/dsfsi/za-african-next-voices},
  url3      = {https://www.dsfsi.co.za/za-african-next-voices/},
  year      = {2025},
  type      = {dataset},
  publisher = {Zenodo},
  version   = {1.0},
  doi       = {10.5281/zenodo.17776289},
  url       = {https://doi.org/10.5281/zenodo.17776289},
}

Research Paper

Will be available soon.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/17776290">Log in</a> to check if you have access.

Additional details

Repository URL: https://github.com/dsfsi/za-african-next-voices/

	All versions	This version
Views	473	473
Downloads	0	0
Data volume	0 Bytes	0 Bytes

The South African Next Voices Multilingual Speech Dataset [Compressed]

Authors/Creators

Description

Swivuriso: ZA-African Next Voices-Compressed

Language Coverage

Use Restriction:

Citations

Dataset

Research Paper

Files

Restricted

Additional details

Software