Published September 22, 2023 | Version 0.1
Dataset Open

TunSwitch: Code-Switched Tunisian Arabic Speech Dataset

  • 1. LTCI, Télécom Paris, Institut Polytechnique de Paris, France
  • 2. Tunis Business School, Tunisia

Description

This folder contains the data used to develop and test the Tunisian Arabic Automatic Speech Recognition model developed in the following paper :

A. A. Ben Abdallah*, A. Kabboudi, A. Kanoun, and S. Zaiem*, “Leveraging data collection and unsupervised learning for code-switched tunisian arabic automatic speech recognition”, Submitted to ICASSP 2024, vol. * : These two authors have contributed equally. 2023.


It contains 4 zipped folders containing audio data :
- TunSwitchCS.zip : containing annotated code-switched data.
- TunSwitchTO.zip : containing annotated Tunisian-Only data.
- weakly_labeled_tn.zip : containing weakly-labeled (or unlabeled) audio data. Audios may contain code-switching, but the current weak labels do not.
- test_wavs.zip : contains annotated testing data, divided between a code-switched part and a tunisian-only part.


It also contains textual data, used for language modelling, contained in TextData.zip. Finally it also contains a language-detailed annotation of TunSwitchCS in the  language_annotation.zip file .

More details about the data are available in the paper. The current table are in a SpeechBrain-friendly format, the column path is irrelevant and has to be changed according to your local setting. Please use the provided train-dev-test splits if you work with this dataset.

Please cite the aforementioned paper if you use or refer to this dataset. You can find models trained and tested on this dataset Here. Space demos are also available. 

If you use or refer to this dataset, please cite : 

```

@misc{abdallah2023leveraging,
      title={Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition}, 
      author={Ahmed Amine Ben Abdallah and Ata Kabboudi and Amir Kanoun and Salah Zaiem},
      year={2023},
      eprint={2309.11327},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

```


 

Files

language_annotation.zip

Files (18.0 GB)

Name Size Download all
md5:8db46856ae8ac1489e8752dc84a70bf4
312.1 kB Preview Download
md5:d02f41b1a1a20e50ab91e9974675e444
180.4 kB Preview Download
md5:dccdd3dacfbdf501dcd0bad5cd95aeb1
1.4 kB Preview Download
md5:7cdd5b100f2f1f4429b74a920ca338fa
284.1 MB Preview Download
md5:6bbb3c6014819d28c6cb08692d27a31e
1.6 GB Preview Download
md5:dce80d1ee335c602ae38637828b87e9e
92.3 MB Preview Download
md5:231674424efccfd4d3f19b2706bafd52
16.0 GB Preview Download