TunSwitch: Code-Switched Tunisian Arabic Speech Dataset
Description
We developed a tool for collecting Tunisian dialect data, prompting users to record themselves reading provided phrases. We sourced sentences from Tunisiya. These sentences are consequently removed from the LM training corpus. 89 persons have participated leading to the collection of 2631 distinct phrases. This set will be called TunSwitch TO, ``TO" standing for Tunisian Only, as these sentences do not have non-Tunisian words.
In response to the limited availability of paired Text-Speech Tunisian datasets with code-switching, we have built a corpus through meticulous manual annotation. Whenever encountered, French and English words are enclosed within "<>" tags, and left Tunisian words without any enclosing tags. While these tags have not been used in the proposed models, they allow to have language-usage statistics and may be useful for further approaches handling code-switching. The resulting set is released as TunSwitch CS, ``CS" standing for Code-Switched.
The TunSwitch CS dataset samples come from a set of radio shows and podcasts, representing diverse topics and a large number of unique speakers. The audio are first segmented into chunks, prioritizing word integrity using the WebRTC-VAD algorithm for silence detection. Afterward, we used a Pyannote overlap detection model to remove overlapping speech sections. Then, a music detection model is employed to eliminate music-containing chunks that could disrupt ASR model accuracy.
Files
paper.pdf
Files
(2.0 GB)
Name | Size | Download all |
---|---|---|
md5:d02f41b1a1a20e50ab91e9974675e444
|
180.4 kB | Preview Download |
md5:7cdd5b100f2f1f4429b74a920ca338fa
|
284.1 MB | Preview Download |
md5:eb23d02b843b3b607670b5bba1887add
|
25.0 MB | Preview Download |
md5:6bbb3c6014819d28c6cb08692d27a31e
|
1.6 GB | Preview Download |
md5:dce80d1ee335c602ae38637828b87e9e
|
92.3 MB | Preview Download |