Dataset Open Access

Code-Switching Speech Corpus

Khosravani, Abbas; Garner, Philip N.


Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Khosravani, Abbas</dc:creator>
  <dc:creator>Garner, Philip N.</dc:creator>
  <dc:date>2021-01-12</dc:date>
  <dc:description>German-English Code-Switching speech dataset

We provide means to resegment a subset of the German **Spoken Wikipedia Corpus** (SWC) enabling a particular focus on code-switching.  This results in the German-English code-switching corpus, a 34h transcribed speech corpus of read Wikipedia articles which can be used as a benchmark for research on code-switching.  The articles are read by a large and diverse group of people. The SWC is perhaps the largest corpus of freely-available aligned speech for German.  It contains 1014 spoken articles read by more than 350 identified speakers comprising 386h of speech. This corpus is available at http://nats.gitlab.io/swc.

In SWC, since most of the articles are long, the recordings submitted by the volunteers are also long (∼54min) on average.  These audio files are manually annotated at word-level and also segment level in XML format.  We use a language identification tool to detect code-switching in the transcription of the audio files with consecutive indices. To extract intra-sentential code-switching segments, we ensure that the detected code-switching is preceded and followed by German words or sentences. The final set consists of 34h of speech data and 12,437 code-switching segments (in Kaldi ASR toolkit data format).

 

Citation

@article{baumann2019spoken,
  title={The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening},
  author={Baumann, Timo and K{\"o}hn, Arne and Hennig, Felix},
  journal={Language Resources and Evaluation},
  volume={53},
  number={2},
  pages={303--329},
  year={2019},
  publisher={Springer}
}

@article{grave2018learning,
  title={Learning word vectors for 157 languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1802.06893},
  year={2018}
}

 </dc:description>
  <dc:identifier>https://zenodo.org/record/4434251</dc:identifier>
  <dc:identifier>10.34777/bkr1-ay03</dc:identifier>
  <dc:identifier>oai:zenodo.org:4434251</dc:identifier>
  <dc:language>deu</dc:language>
  <dc:relation>url:https://zenodo.org/communities/idiap</dc:relation>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>https://creativecommons.org/licenses/by-sa/3.0/legalcode</dc:rights>
  <dc:subject>Automatic Speech recognition</dc:subject>
  <dc:subject>German</dc:subject>
  <dc:subject>Spoken Wikipedia Corpu</dc:subject>
  <dc:subject>Read speech corpus</dc:subject>
  <dc:title>Code-Switching Speech Corpus</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
  <dc:type>dataset</dc:type>
</oai_dc:dc>
262
108
views
downloads
Views 262
Downloads 108
Data volume 112.7 MB
Unique views 239
Unique downloads 85

Share

Cite as