waleghwa/low-resource-language-data: Parallel Corpora for Kiswahili and Kidaw'ida, Kalenjin and Dholuo
Creators
Contributors
Description
Description: The dataset consists of three parallel corpora: Kidaw'ida-Kiswahili; Kalenjin-Kiswahili; Dholuo-Kiswahili. On averate, each corpus has thirty thousand sentence pairs. This dataset is also available on GitHub where it will continue to be grown and its quality improved. Future releases will be uploaded here on Zenodo as new versions.
Purpose of the dataset: The dataset was created for use in training machine translation models. This is to enable translation from Kiswahili, which is the national language in Kenya, into indigenous languages. Three indigenous Kenyan languages were targeted, namely, Kidaw'ida, Kalenjin, and Dholuo.
Principal Investigator: Audrey Mbogho, United States International University - Africa
Co-Investigators:
- Andrew Kipkebut, Kabarak University
- Quin Awuor, United States International University - Africa
- Rose Lugano, University of Florida
- Lilian Wanzare, Maseno University
- Vivian Oloo, Maseno University
Funding: This dataset was collected with funding from Lacuna Fund.
Files
waleghwa/low-resource-language-data-v1.0.0.zip
Files
(3.3 MB)
Name | Size | Download all |
---|---|---|
md5:6d5ee1290557218da861b667a2434635
|
3.3 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/waleghwa/low-resource-language-data/tree/v1.0.0 (URL)