Dataset Open Access

Code-Switching Speech Corpus

Khosravani, Abbas; Garner, Philip N.


JSON-LD (schema.org) Export

{
  "inLanguage": {
    "alternateName": "deu", 
    "@type": "Language", 
    "name": "German"
  }, 
  "description": "<p><strong>German-English Code-Switching speech dataset</strong></p>\n\n<p>We provide means to resegment a subset of the German **Spoken Wikipedia Corpus** (SWC) enabling a particular focus on code-switching.&nbsp; This results in the German-English code-switching corpus, a 34h transcribed speech corpus of read Wikipedia articles which can be used as a benchmark for research on code-switching.&nbsp; The articles are read by a large and diverse group of people. The SWC is perhaps the largest corpus of freely-available aligned speech for German.&nbsp; It contains 1014 spoken articles read by more than 350 identified speakers comprising 386h of speech. This corpus is available at http://nats.gitlab.io/swc.</p>\n\n<p>In SWC, since most of the articles are long, the recordings submitted by the volunteers are also long (&sim;54min) on average.&nbsp; These audio files are manually annotated at word-level and also segment level in XML format.&nbsp; We use a language identification tool to detect code-switching in the transcription of the audio files with consecutive indices. To extract intra-sentential code-switching segments, we ensure that the detected code-switching is preceded and followed by German words or sentences. The final set consists of 34h of speech data and 12,437 code-switching segments (in Kaldi ASR toolkit data format).</p>\n\n<p>&nbsp;</p>\n\n<p><strong>Citation</strong></p>\n\n<p>@article{baumann2019spoken,<br>\n&nbsp; title={The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening},<br>\n&nbsp; author={Baumann, Timo and K{\\&quot;o}hn, Arne and Hennig, Felix},<br>\n&nbsp; journal={Language Resources and Evaluation},<br>\n&nbsp; volume={53},<br>\n&nbsp; number={2},<br>\n&nbsp; pages={303--329},<br>\n&nbsp; year={2019},<br>\n&nbsp; publisher={Springer}<br>\n}</p>\n\n<p>@article{grave2018learning,<br>\n&nbsp; title={Learning word vectors for 157 languages},<br>\n&nbsp; author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},<br>\n&nbsp; journal={arXiv preprint arXiv:1802.06893},<br>\n&nbsp; year={2018}<br>\n}</p>\n\n<p>&nbsp;</p>", 
  "license": "https://creativecommons.org/licenses/by-sa/3.0/legalcode", 
  "creator": [
    {
      "affiliation": "Idiap Research Institute", 
      "@id": "https://orcid.org/0000-0002-7108-2475", 
      "@type": "Person", 
      "name": "Khosravani, Abbas"
    }, 
    {
      "affiliation": "Idiap Research Institute", 
      "@id": "https://orcid.org/0000-0002-0814-1348", 
      "@type": "Person", 
      "name": "Garner, Philip N."
    }
  ], 
  "url": "https://zenodo.org/record/4434251", 
  "datePublished": "2021-01-12", 
  "keywords": [
    "Automatic Speech recognition", 
    "German", 
    "Spoken Wikipedia Corpu", 
    "Read speech corpus"
  ], 
  "@context": "https://schema.org/", 
  "distribution": [
    {
      "contentUrl": "https://zenodo.org/api/files/15728244-b85b-43c1-8120-3d604543f623/code-switching.tar.gz", 
      "encodingFormat": "gz", 
      "@type": "DataDownload"
    }, 
    {
      "contentUrl": "https://zenodo.org/api/files/15728244-b85b-43c1-8120-3d604543f623/MD5SUM.txt", 
      "encodingFormat": "txt", 
      "@type": "DataDownload"
    }
  ], 
  "identifier": "https://doi.org/10.34777/bkr1-ay03", 
  "@id": "https://doi.org/10.34777/bkr1-ay03", 
  "@type": "Dataset", 
  "name": "Code-Switching Speech Corpus"
}
267
112
views
downloads
Views 267
Downloads 112
Data volume 117.0 MB
Unique views 244
Unique downloads 88

Share

Cite as