Published February 16, 2023 | Version 1.2
Dataset Restricted

SautiDB: Nigerian Accent Dataset Collection


The SautiDB dataset collection project is an ongoing effort to collect datasets of various Nigerian accents. The dataset was collected in an uncontrolled manner, users who visit our webapp can record their voice and contribute to the dataset. The webapp uses the audio webapi to collect voice samples. We hope this dataset will be useful to people interested in developing voice technology in Nigeria. We will continuously collect more datasets and publish updated versions as we have them. This work grew out of our project Improving Online Experience using Accent Transfer.

The filename is of the form nativeLanguage_fluentLanguage_speakerID_gender_sentenceID.wav, where

  • nativeLanguage: language spoken by the speaker's tribe. Native (mother) language of the speaker
  • fluentLanguage: language that the speaker feels best describes their accents
  • speakerID: ID, assigned to the speaker. It is possible for a speaker to have multiple IDs assigned since we are not authenticating users, we simply cached their browser sessions. 
  • gender: gender of the speaker. We did not explicitly collect this information from users, we hand-labeled it. 
  • sentenceID: the sentence ID for the sentences read. We used the CMU Arctic sentences.

Before Postprocessing
Number of Samples: 1615
Size Webm: 59MB
Size Wav: 847MB
Sampling Rate: 48000Hz
Total Time: 2hrs 30min 21sec

After Postprocessing
Number of Samples: 919
Size Wav: 336MB
Sampling Rate: 48000Hz
Total Time: 0hrs 59min 08sec

Version 1.1
This version has two updates:

1. In version 1.0, the naming convention for each language was to space each language with an underscore and uppercased, e.g., "Efik Ibibio" -> "EFIK_IBIBIO". We have changed "EFIK_IBIBIO" -> "EFIKIBIBIO". i.e. the file name, which was previously 'EFIK_IBIBIO_EFIK_IBIBIO_0014_M_A0138.wav', has now been changed to 'EFIKIBIBIO_EFIKIBIBIO_0014_M_A0138.wav'. This change applies only to languages that contain spaces. The rest of the filenames, therefore, remain unchanged, i.e. 'EDO_YORUBA_0053_M_B0389.wav' is still 'EDO_YORUBA_0053_M_B0389.wav'.

2. We include an audio_metadata.csv file containing 'filename', 'nativeLanguage', 'fluentLanguage', 'speakerID', 'gender', 'sentenceID' and 'sentence', 'duration'. We hope this will make it easier for users to use our dataset for their work. The duration was calculated using the function 'librosa.get_duration()'.

Version 1.2
This version includes Hausa Langauge.

After Preprocessing
Number of Samples: 1137
Size Wav: 426MB
Sampling Rate: 48000Hz
Total Time: 1hrs 15min 24sec

The associated Github repository used for post-processing can also be found linked. We are grateful for funding from AI4D-IndabaX with IDRC Grant Number: 109187-002.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.



The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.






You are currently not logged in. Do you have an account? Log in here