Dataset Open Access

Code-Switching Speech Corpus

Khosravani, Abbas; Garner, Philip N.

Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="" xmlns:oai_dc="" xmlns:xsi="" xsi:schemaLocation="">
  <dc:creator>Khosravani, Abbas</dc:creator>
  <dc:creator>Garner, Philip N.</dc:creator>
  <dc:description>German-English Code-Switching speech dataset

We provide means to resegment a subset of the German **Spoken Wikipedia Corpus** (SWC) enabling a particular focus on code-switching.  This results in the German-English code-switching corpus, a 34h transcribed speech corpus of read Wikipedia articles which can be used as a benchmark for research on code-switching.  The articles are read by a large and diverse group of people. The SWC is perhaps the largest corpus of freely-available aligned speech for German.  It contains 1014 spoken articles read by more than 350 identified speakers comprising 386h of speech. This corpus is available at

In SWC, since most of the articles are long, the recordings submitted by the volunteers are also long (∼54min) on average.  These audio files are manually annotated at word-level and also segment level in XML format.  We use a language identification tool to detect code-switching in the transcription of the audio files with consecutive indices. To extract intra-sentential code-switching segments, we ensure that the detected code-switching is preceded and followed by German words or sentences. The final set consists of 34h of speech data and 12,437 code-switching segments (in Kaldi ASR toolkit data format).



  title={The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening},
  author={Baumann, Timo and K{\"o}hn, Arne and Hennig, Felix},
  journal={Language Resources and Evaluation},

  title={Learning word vectors for 157 languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1802.06893},

  <dc:subject>Automatic Speech recognition</dc:subject>
  <dc:subject>Spoken Wikipedia Corpu</dc:subject>
  <dc:subject>Read speech corpus</dc:subject>
  <dc:title>Code-Switching Speech Corpus</dc:title>
Views 262
Downloads 108
Data volume 112.7 MB
Unique views 239
Unique downloads 85


Cite as