Published September 14, 2023 | Version v1
Dataset Open

TunSwitch: Code-Switched Tunisian Arabic Speech Dataset

  • 1. Telecom Paris
  • 2. Tunis Business School

Description

We developed a tool for collecting Tunisian dialect data, prompting users to record themselves reading provided phrases. We sourced sentences from Tunisiya. These sentences are consequently removed from the LM training corpus. 89 persons have participated leading to the collection of 2631 distinct phrases. This set will be called TunSwitch TO, ``TO" standing for Tunisian Only, as these sentences do not have non-Tunisian words. 

In response to the limited availability of paired Text-Speech Tunisian datasets with  code-switching, we have built a  corpus through meticulous manual annotation. Whenever encountered, French and English  words are enclosed  within "<>" tags, and left Tunisian words without any enclosing tags. While these tags have not been used in the proposed models, they allow to have language-usage statistics  and may be useful for further approaches handling code-switching. The resulting set is released as TunSwitch CS, ``CS" standing for Code-Switched.

The TunSwitch CS dataset samples come from a set of radio shows and podcasts, representing diverse topics and a large number of unique speakers. The audio are first segmented into chunks, prioritizing word integrity using the WebRTC-VAD algorithm for silence detection. Afterward, we used a Pyannote overlap detection model to remove overlapping speech sections. Then, a music detection model is employed to eliminate music-containing chunks that could disrupt ASR model accuracy. 
 

 

Files

paper.pdf

Files (2.0 GB)

Name Size Download all
md5:d02f41b1a1a20e50ab91e9974675e444
180.4 kB Preview Download
md5:7cdd5b100f2f1f4429b74a920ca338fa
284.1 MB Preview Download
md5:eb23d02b843b3b607670b5bba1887add
25.0 MB Preview Download
md5:6bbb3c6014819d28c6cb08692d27a31e
1.6 GB Preview Download
md5:dce80d1ee335c602ae38637828b87e9e
92.3 MB Preview Download