Published March 29, 2024 | Version 1
Dataset Open

Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal

  • 1. ROR icon Orange (France)
  • 2. Jokalante SARL
  • 3. ROR icon École Polytechnique de Thiès
  • 1. ROR icon Ziguinchor University
  • 2. ROR icon Cheikh Anta Diop University

Description

This data is transcribed speech data, in Wolof, Pulaar and Sereer.

The recordings are about agriculture. The recorded consist of farmers, agricultural advisers, and agri-food business managers. Type of recordings comprise interactive radio programmes, focus groups, voice messages, push messages and interviews. Therefore, spontaneous speech is prevailing. Quality of audio may vary depending on the type of programme.

Content description :

  • speech_dataset_wol.tar.gz: Wolof (ISO Code 639-2: wol) speech dataset contains 55 hours of transcribed speech, including almost 13 hours of validated content check by an expert. It also contains a XSAMPA lexicon (49,132 phonetised entries) and a text corpus (1,140,508 words).
  • speech_dataset_fuc.tar.gz: Pulaar (ISO Code 639-2: fuc) speech dataset contains nearly 32 hours of transcribed speech, including around 11 hours of validated content check by an expert. It also contains a text corpus (742,024 words).
  • speech_dataset_srr.tar.gz: Sereer (ISO Code 639-2: srr) speech dataset contains 38 hours of transcribed speech, including nearly 11 hours of validated content check by an expert.
    In total, these resources provide 125 hours of transcribed speech in the 3 most widely spoken languages in Senegal, including 35 hours of checked transcriptions.

This work is a result of the Kallaama project, funded by Lacuna Fund for 1 year, in 2023. 

See the GitHub repository for more details about the dataset.

Files

Files (12.6 GB)

Name Size Download all
md5:87895c981fa9593e232380e07c331c59
3.1 GB Download
md5:5e11b5c8140680e74009c13c43366cc6
4.0 GB Download
md5:984d0804f628a3257f3d063ff4a4be31
5.4 GB Download

Additional details

Additional titles

Alternative title (Wolof)
Kallaama Wolof speech dataset
Alternative title (Pulaar)
Kallaama Pulaar speech dataset
Alternative title (Serer)
Kallaama Sereer speech dataset

Related works

Is published in
Conference paper: arXiv:2404.01991 (arXiv)

Dates

Collected
2023
Data collection