MENYO-20k: A Multi-domain English - Yorùbá Corpus for Machine Translation

David Ifeoluwa Adelani; Jesujoba O. Alabi; Damilola Adebonojo; Adesina Ayeni; Mofe Adeyemi; Ayodele Awokoya

doi:10.5281/zenodo.4297448

Published November 30, 2020 | Version 1.0

Dataset Open

MENYO-20k: A Multi-domain English - Yorùbá Corpus for Machine Translation

1. Saarland University

Contributors

Data curator (7):

MENYO-20k is a multi-domain parallel dataset with texts obtained from news articles, ted talks, movie transcripts, radio transcripts, science and technology texts, and other short articles curated from the web and professional translators. The dataset has 20,100 parallel sentences split into 10,070 training sentences, 3,397 development sentences, and 6,633 test sentences (3,419 multi-domain, 1,714 news domain, and 1,500 ted talks speech transcript domain)

The dataset is open but for non-commercial use because some of the data sources like Ted talks and JW news requires permission for commercial use.

Acknowledgement: This project was supported by the AI4D language dataset fellowship through K4All and Zindi Africa

Files

readme.txt

Files (2.5 MB)

Name	Size	Download all
readme.txt md5:c845d69bec68ac19a2f57338a763bbbb	2.5 kB	Preview Download
train.tsv md5:06e1851230484547e03c8a1036d76bc7	2.5 MB	Download

Additional details

David I. Adelani et al. MENYO-20k: A Multi-domain English - Yorùbá Corpus for Machine Translation. https://github.com/dadelani/menyo-20k_MT

	All versions	This version
Views	1,103	1,098
Downloads	351	349
Data volume	574.1 MB	571.6 MB

MENYO-20k: A Multi-domain English - Yorùbá Corpus for Machine Translation

Authors/Creators

Contributors

Data curator (7):

Description

Files

readme.txt

Files (2.5 MB)

Additional details

References