CL-MASR

Luca Della Libera; Pooneh Mousavi; Salah Zaiem; Cem Subakan; Mirco Ravanelli

doi:10.5281/zenodo.8065754

Published June 22, 2023 | Version 0.0.1

Dataset Open

CL-MASR

1. Concordia University
2. Télécom Paris
3. Université Laval

CL-MASR Dataset

This is the dataset used in the continual learning for multilingual ASR (CL-MASR) benchmark. It is composed of speech recordings from 20 languages selected from the Common Voice 13 dataset. For each language, it includes up to 10/1/1 hours for train/dev/test, respectively.

The CL-MASR benchmark platform is available in the SpeechBrain toolkit (see recipes/CommonVoice):
https://github.com/speechbrain/speechbrain

The original Common Voice 13 data are available at:
https://commonvoice.mozilla.org/en/datasets

List of Languages

- English (en)
- Chinese (zh-CN)
- German (de)
- Spanish (es)
- Russian (ru)
- French (fr)
- Portuguese (pt)
- Japanese (ja)
- Turkish (tr)
- Polish (pl)
- Kinyarwanda (rw)
- Esperanto (eo)
- Kabyle (kab)
- Luganda (lg)
- Meadow Mari (mhr)
- Central Kurdish (ckb)
- Abkhaz (ab)
- Kurmanji Kurdish (kmr)
- Frisian (fy-NL)
- Interlingua (ia)

Files

Files (5.9 GB)

Name	Size	Download all
CL-MASR.tar.gz md5:810aa5d24ddacd3bf4c870b2c577f9c4	5.9 GB	Download

Additional details

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y. Gao, R. D. Mori, and Y. Bengio. SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624, 2021.
R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber. Common Voice: A massively-multilingual speech corpus. In Twelfth Language Resources and Evaluation Conference, pages 4218–4222, 2020.

	All versions	This version
Views	228	227
Downloads	91	90
Data volume	617.8 GB	611.9 GB

CL-MASR

Creators

Description

Files

Files (5.9 GB)

Additional details

References