Published December 30, 2019 | Version v1
Journal article Open

Siminchik: A Speech Corpus for Preservation of Southern Quechua

Description

Languages are disappearing at an alarming rate, linguistics rights of speakers of most of the 7000 languages are under risk. ICT plays a key role in the preservation of endangered languages; as ultimate use of ICT, natural language processing must be highlighted since in this century the lack of such support hampers literacy acquisition as well as prevents the use of Internet and any electronic means.
The first step is the building of resources for processing, therefore we introduce the first speech corpus of Southern Quechua, Siminchik, suitable for training and evaluating speech recognition systems. The corpus consists of 97 hours of spontaneous conversations recorded in radio programs in the Southern regions of Peru. The annotation task was carried out by native speakers from those regions using the unified written convention. We present initial experiments on speech recognition and language modeling and explain the challenges inherent to the nature and current status of this ancestral language.

Notes

Cardenas, R., Zevallos, R., Baquerizo, R., & Camacho, L. (2018). Siminchik: A Speech Corpus for Preservation of Southern Quechua. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC'18).

Files

paper02siminchik-speech-corpusPaperJapon.pdf

Files (197.2 kB)

Name Size Download all
md5:b4280a17076c042908937511ab47b1e1
197.2 kB Preview Download