CL-MASR
- 1. Concordia University
- 2. Télécom Paris
- 3. Université Laval
Description
CL-MASR Dataset
This is the dataset used in the continual learning for multilingual ASR (CL-MASR) benchmark. It is composed of speech recordings from 20 languages selected from the Common Voice 13 dataset. For each language, it includes up to 10/1/1 hours for train/dev/test, respectively.
The CL-MASR benchmark platform is available in the SpeechBrain toolkit (see recipes/CommonVoice):
https://github.com/speechbrain/speechbrain
The original Common Voice 13 data are available at:
https://commonvoice.mozilla.org/en/datasets
List of Languages
- English (en)
- Chinese (zh-CN)
- German (de)
- Spanish (es)
- Russian (ru)
- French (fr)
- Portuguese (pt)
- Japanese (ja)
- Turkish (tr)
- Polish (pl)
- Kinyarwanda (rw)
- Esperanto (eo)
- Kabyle (kab)
- Luganda (lg)
- Meadow Mari (mhr)
- Central Kurdish (ckb)
- Abkhaz (ab)
- Kurmanji Kurdish (kmr)
- Frisian (fy-NL)
- Interlingua (ia)
Files
Files
(5.9 GB)
Name | Size | Download all |
---|---|---|
md5:810aa5d24ddacd3bf4c870b2c577f9c4
|
5.9 GB | Download |
Additional details
References
- M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y. Gao, R. D. Mori, and Y. Bengio. SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624, 2021.
- R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber. Common Voice: A massively-multilingual speech corpus. In Twelfth Language Resources and Evaluation Conference, pages 4218–4222, 2020.