Khosravani, Abbas
Garner, Philip N.
2021-01-12
<p><strong>German-English Code-Switching speech dataset</strong></p>
<p>We provide means to resegment a subset of the German **Spoken Wikipedia Corpus** (SWC) enabling a particular focus on code-switching. This results in the German-English code-switching corpus, a 34h transcribed speech corpus of read Wikipedia articles which can be used as a benchmark for research on code-switching. The articles are read by a large and diverse group of people. The SWC is perhaps the largest corpus of freely-available aligned speech for German. It contains 1014 spoken articles read by more than 350 identified speakers comprising 386h of speech. This corpus is available at http://nats.gitlab.io/swc.</p>
<p>In SWC, since most of the articles are long, the recordings submitted by the volunteers are also long (∼54min) on average. These audio files are manually annotated at word-level and also segment level in XML format. We use a language identification tool to detect code-switching in the transcription of the audio files with consecutive indices. To extract intra-sentential code-switching segments, we ensure that the detected code-switching is preceded and followed by German words or sentences. The final set consists of 34h of speech data and 12,437 code-switching segments (in Kaldi ASR toolkit data format).</p>
<p> </p>
<p><strong>Citation</strong></p>
<p>@article{baumann2019spoken,<br>
title={The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening},<br>
author={Baumann, Timo and K{\"o}hn, Arne and Hennig, Felix},<br>
journal={Language Resources and Evaluation},<br>
volume={53},<br>
number={2},<br>
pages={303--329},<br>
year={2019},<br>
publisher={Springer}<br>
}</p>
<p>@article{grave2018learning,<br>
title={Learning word vectors for 157 languages},<br>
author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},<br>
journal={arXiv preprint arXiv:1802.06893},<br>
year={2018}<br>
}</p>
<p> </p>
https://doi.org/10.34777/bkr1-ay03
oai:zenodo.org:4434251
deu
Zenodo
https://zenodo.org/communities/idiap
info:eu-repo/semantics/openAccess
Creative Commons Attribution Share Alike 3.0 Unported
https://creativecommons.org/licenses/by-sa/3.0/legalcode
Automatic Speech recognition
German
Spoken Wikipedia Corpu
Read speech corpus
Code-Switching Speech Corpus
info:eu-repo/semantics/other