Dataset Open Access

Code-Switching Speech Corpus

Khosravani, Abbas; Garner, Philip N.


JSON Export

{
  "files": [
    {
      "links": {
        "self": "https://zenodo.org/api/files/15728244-b85b-43c1-8120-3d604543f623/code-switching.tar.gz"
      }, 
      "checksum": "md5:ee84d7577cb7bcb080fae2e0a903b700", 
      "bucket": "15728244-b85b-43c1-8120-3d604543f623", 
      "key": "code-switching.tar.gz", 
      "type": "gz", 
      "size": 1426652
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/15728244-b85b-43c1-8120-3d604543f623/MD5SUM.txt"
      }, 
      "checksum": "md5:7007facb8694b47b27097fd0b73ce12f", 
      "bucket": "15728244-b85b-43c1-8120-3d604543f623", 
      "key": "MD5SUM.txt", 
      "type": "txt", 
      "size": 56
    }
  ], 
  "owners": [
    52256
  ], 
  "doi": "10.34777/bkr1-ay03", 
  "stats": {
    "version_unique_downloads": 86.0, 
    "unique_views": 242.0, 
    "views": 265.0, 
    "version_views": 265.0, 
    "unique_downloads": 86.0, 
    "version_unique_views": 242.0, 
    "volume": 114133840.0, 
    "version_downloads": 110.0, 
    "downloads": 110.0, 
    "version_volume": 114133840.0
  }, 
  "links": {
    "doi": "https://doi.org/10.34777/bkr1-ay03", 
    "latest_html": "https://zenodo.org/record/4434251", 
    "bucket": "https://zenodo.org/api/files/15728244-b85b-43c1-8120-3d604543f623", 
    "badge": "https://zenodo.org/badge/doi/10.34777/bkr1-ay03.svg", 
    "html": "https://zenodo.org/record/4434251", 
    "latest": "https://zenodo.org/api/records/4434251"
  }, 
  "created": "2021-01-12T11:58:32.323626+00:00", 
  "updated": "2021-01-13T03:37:12.393521+00:00", 
  "conceptrecid": "4434250", 
  "revision": 2, 
  "id": 4434251, 
  "metadata": {
    "access_right_category": "success", 
    "doi": "10.34777/bkr1-ay03", 
    "description": "<p><strong>German-English Code-Switching speech dataset</strong></p>\n\n<p>We provide means to resegment a subset of the German **Spoken Wikipedia Corpus** (SWC) enabling a particular focus on code-switching.&nbsp; This results in the German-English code-switching corpus, a 34h transcribed speech corpus of read Wikipedia articles which can be used as a benchmark for research on code-switching.&nbsp; The articles are read by a large and diverse group of people. The SWC is perhaps the largest corpus of freely-available aligned speech for German.&nbsp; It contains 1014 spoken articles read by more than 350 identified speakers comprising 386h of speech. This corpus is available at http://nats.gitlab.io/swc.</p>\n\n<p>In SWC, since most of the articles are long, the recordings submitted by the volunteers are also long (&sim;54min) on average.&nbsp; These audio files are manually annotated at word-level and also segment level in XML format.&nbsp; We use a language identification tool to detect code-switching in the transcription of the audio files with consecutive indices. To extract intra-sentential code-switching segments, we ensure that the detected code-switching is preceded and followed by German words or sentences. The final set consists of 34h of speech data and 12,437 code-switching segments (in Kaldi ASR toolkit data format).</p>\n\n<p>&nbsp;</p>\n\n<p><strong>Citation</strong></p>\n\n<p>@article{baumann2019spoken,<br>\n&nbsp; title={The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening},<br>\n&nbsp; author={Baumann, Timo and K{\\&quot;o}hn, Arne and Hennig, Felix},<br>\n&nbsp; journal={Language Resources and Evaluation},<br>\n&nbsp; volume={53},<br>\n&nbsp; number={2},<br>\n&nbsp; pages={303--329},<br>\n&nbsp; year={2019},<br>\n&nbsp; publisher={Springer}<br>\n}</p>\n\n<p>@article{grave2018learning,<br>\n&nbsp; title={Learning word vectors for 157 languages},<br>\n&nbsp; author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},<br>\n&nbsp; journal={arXiv preprint arXiv:1802.06893},<br>\n&nbsp; year={2018}<br>\n}</p>\n\n<p>&nbsp;</p>", 
    "language": "deu", 
    "title": "Code-Switching Speech Corpus", 
    "license": {
      "id": "CC-BY-SA-3.0"
    }, 
    "relations": {
      "version": [
        {
          "count": 1, 
          "index": 0, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "4434250"
          }, 
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "4434251"
          }
        }
      ]
    }, 
    "communities": [
      {
        "id": "idiap"
      }
    ], 
    "keywords": [
      "Automatic Speech recognition", 
      "German", 
      "Spoken Wikipedia Corpu", 
      "Read speech corpus"
    ], 
    "publication_date": "2021-01-12", 
    "creators": [
      {
        "orcid": "0000-0002-7108-2475", 
        "affiliation": "Idiap Research Institute", 
        "name": "Khosravani, Abbas"
      }, 
      {
        "orcid": "0000-0002-0814-1348", 
        "affiliation": "Idiap Research Institute", 
        "name": "Garner, Philip N."
      }
    ], 
    "access_right": "open", 
    "resource_type": {
      "type": "dataset", 
      "title": "Dataset"
    }
  }
}
265
110
views
downloads
Views 265
Downloads 110
Data volume 114.1 MB
Unique views 242
Unique downloads 86

Share

Cite as