TunSwitch: Code-Switched Tunisian Arabic Speech Dataset

Zaiem, Salah; Ben Abdallah, Ahmed Amine

doi:10.5281/zenodo.8370566

Published September 22, 2023 | Version 0.1

Dataset Open

TunSwitch: Code-Switched Tunisian Arabic Speech Dataset

1. LTCI, Télécom Paris, Institut Polytechnique de Paris, France
2. Tunis Business School, Tunisia

This folder contains the data used to develop and test the Tunisian Arabic Automatic Speech Recognition model developed in the following paper :

A. A. Ben Abdallah*, A. Kabboudi, A. Kanoun, and S. Zaiem*, “Leveraging data collection and unsupervised learning for code-switched tunisian arabic automatic speech recognition”, Submitted to ICASSP 2024, vol. * : These two authors have contributed equally. 2023.

It contains 4 zipped folders containing audio data :
- TunSwitchCS.zip : containing annotated code-switched data.
- TunSwitchTO.zip : containing annotated Tunisian-Only data.
- weakly_labeled_tn.zip : containing weakly-labeled (or unlabeled) audio data. Audios may contain code-switching, but the current weak labels do not.
- test_wavs.zip : contains annotated testing data, divided between a code-switched part and a tunisian-only part.

It also contains textual data, used for language modelling, contained in TextData.zip. Finally it also contains a language-detailed annotation of TunSwitchCS in the language_annotation.zip file .

More details about the data are available in the paper. The current table are in a SpeechBrain-friendly format, the column path is irrelevant and has to be changed according to your local setting. Please use the provided train-dev-test splits if you work with this dataset.

Please cite the aforementioned paper if you use or refer to this dataset. You can find models trained and tested on this dataset Here. Space demos are also available.

If you use or refer to this dataset, please cite :

```

@misc{abdallah2023leveraging,
title={Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition},
author={Ahmed Amine Ben Abdallah and Ata Kabboudi and Amir Kanoun and Salah Zaiem},
year={2023},
eprint={2309.11327},
archivePrefix={arXiv},
primaryClass={eess.AS}
}

```

Files

language_annotation.zip

Files (18.0 GB)

Name	Size
language_annotation.zip md5:8db46856ae8ac1489e8752dc84a70bf4	312.1 kB	Preview Download
paper.pdf md5:d02f41b1a1a20e50ab91e9974675e444	180.4 kB	Preview Download
readme.txt md5:dccdd3dacfbdf501dcd0bad5cd95aeb1	1.4 kB	Preview Download
test_wavs.zip md5:7cdd5b100f2f1f4429b74a920ca338fa	284.1 MB	Preview Download
TunSwitchCS.zip md5:6bbb3c6014819d28c6cb08692d27a31e	1.6 GB	Preview Download
TunSwitchTO.zip md5:dce80d1ee335c602ae38637828b87e9e	92.3 MB	Preview Download
weakly_labeled_tn.zip md5:231674424efccfd4d3f19b2706bafd52	16.0 GB	Preview Download

	All versions	This version
Views	1,976	1,960
Downloads	1,992	1,989
Data volume	7.1 TB	7.1 TB

TunSwitch: Code-Switched Tunisian Arabic Speech Dataset

Authors/Creators

Description

Files

language_annotation.zip

Files (18.0 GB)