{
  "DOI": "10.34777/bkr1-ay03",
  "abstract": "German-English Code-Switching speech dataset\n\n\nWe provide means to resegment a subset of the German **Spoken Wikipedia Corpus** (SWC) enabling a particular focus on code-switching.\u00a0 This results in the German-English code-switching corpus, a 34h transcribed speech corpus of read Wikipedia articles which can be used as a benchmark for research on code-switching.\u00a0 The articles are read by a large and diverse group of people. The SWC is perhaps the largest corpus of freely-available aligned speech for German.\u00a0 It contains 1014 spoken articles read by more than 350 identified speakers comprising 386h of speech. This corpus is available at http://nats.gitlab.io/swc.\n\n\nIn SWC, since most of the articles are long, the recordings submitted by the volunteers are also long (\u223c54min) on average.\u00a0 These audio files are manually annotated at word-level and also segment level in XML format.\u00a0 We use a language identification tool to detect code-switching in the transcription of the audio files with consecutive indices. To extract intra-sentential code-switching segments, we ensure that the detected code-switching is preceded and followed by German words or sentences. The final set consists of 34h of speech data and 12,437 code-switching segments (in Kaldi ASR toolkit data format).\n\n\n\u00a0\n\n\nCitation\n\n\n@article{baumann2019spoken,\n\u00a0 title={The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening},\n\u00a0 author={Baumann, Timo and K{\\\"o}hn, Arne and Hennig, Felix},\n\u00a0 journal={Language Resources and Evaluation},\n\u00a0 volume={53},\n\u00a0 number={2},\n\u00a0 pages={303--329},\n\u00a0 year={2019},\n\u00a0 publisher={Springer}\n}\n\n\n@article{grave2018learning,\n\u00a0 title={Learning word vectors for 157 languages},\n\u00a0 author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},\n\u00a0 journal={arXiv preprint arXiv:1802.06893},\n\u00a0 year={2018}\n}\n\n\n\u00a0",
  "author": [
    {
      "family": "Khosravani",
      "given": "Abbas"
    },
    {
      "family": "Garner",
      "given": "Philip N."
    }
  ],
  "id": "4434251",
  "issued": {
    "date-parts": [
      [
        "2021",
        "01",
        "12"
      ]
    ]
  },
  "language": "deu",
  "publisher": "Zenodo",
  "title": "Code-Switching Speech Corpus",
  "type": "dataset"
}