TunSwitch: Code-Switched Tunisian Arabic Speech Dataset

Zaiem, Salah; Ben Abdallah, Ahmed Amine

doi:10.5281/zenodo.8342762

Published September 14, 2023 | Version v1

Dataset Open

TunSwitch: Code-Switched Tunisian Arabic Speech Dataset

1. Telecom Paris
2. Tunis Business School

We developed a tool for collecting Tunisian dialect data, prompting users to record themselves reading provided phrases. We sourced sentences from Tunisiya. These sentences are consequently removed from the LM training corpus. 89 persons have participated leading to the collection of 2631 distinct phrases. This set will be called TunSwitch TO, ``TO" standing for Tunisian Only, as these sentences do not have non-Tunisian words.

In response to the limited availability of paired Text-Speech Tunisian datasets with code-switching, we have built a corpus through meticulous manual annotation. Whenever encountered, French and English words are enclosed within "<>" tags, and left Tunisian words without any enclosing tags. While these tags have not been used in the proposed models, they allow to have language-usage statistics and may be useful for further approaches handling code-switching. The resulting set is released as TunSwitch CS, ``CS" standing for Code-Switched.

The TunSwitch CS dataset samples come from a set of radio shows and podcasts, representing diverse topics and a large number of unique speakers. The audio are first segmented into chunks, prioritizing word integrity using the WebRTC-VAD algorithm for silence detection. Afterward, we used a Pyannote overlap detection model to remove overlapping speech sections. Then, a music detection model is employed to eliminate music-containing chunks that could disrupt ASR model accuracy.

Files

paper.pdf

Files (2.0 GB)

Name	Size	Download all
paper.pdf md5:d02f41b1a1a20e50ab91e9974675e444	180.4 kB	Preview Download
test_wavs.zip md5:7cdd5b100f2f1f4429b74a920ca338fa	284.1 MB	Preview Download
TextData.zip md5:eb23d02b843b3b607670b5bba1887add	25.0 MB	Preview Download
TunSwitchCS.zip md5:6bbb3c6014819d28c6cb08692d27a31e	1.6 GB	Preview Download
TunSwitchTO.zip md5:dce80d1ee335c602ae38637828b87e9e	92.3 MB	Preview Download

	All versions	This version
Views	1,747	1,727
Downloads	1,587	1,583
Data volume	387.0 GB	385.2 GB

TunSwitch: Code-Switched Tunisian Arabic Speech Dataset

Authors/Creators

Description

Files

paper.pdf

Files (2.0 GB)