Dataset Open Access

Code-Switching Speech Corpus

Khosravani, Abbas; Garner, Philip N.


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="041" ind1=" " ind2=" ">
    <subfield code="a">deu</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Automatic Speech recognition</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">German</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Spoken Wikipedia Corpu</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Read speech corpus</subfield>
  </datafield>
  <controlfield tag="005">20210113033712.0</controlfield>
  <controlfield tag="001">4434251</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Idiap Research Institute</subfield>
    <subfield code="0">(orcid)0000-0002-0814-1348</subfield>
    <subfield code="a">Garner, Philip N.</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">1426652</subfield>
    <subfield code="z">md5:ee84d7577cb7bcb080fae2e0a903b700</subfield>
    <subfield code="u">https://zenodo.org/record/4434251/files/code-switching.tar.gz</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">56</subfield>
    <subfield code="z">md5:7007facb8694b47b27097fd0b73ce12f</subfield>
    <subfield code="u">https://zenodo.org/record/4434251/files/MD5SUM.txt</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2021-01-12</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="p">user-idiap</subfield>
    <subfield code="o">oai:zenodo.org:4434251</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Idiap Research Institute</subfield>
    <subfield code="0">(orcid)0000-0002-7108-2475</subfield>
    <subfield code="a">Khosravani, Abbas</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Code-Switching Speech Corpus</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-idiap</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by-sa/3.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution Share Alike 3.0 Unported</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;&lt;strong&gt;German-English Code-Switching speech dataset&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We provide means to resegment a subset of the German **Spoken Wikipedia Corpus** (SWC) enabling a particular focus on code-switching.&amp;nbsp; This results in the German-English code-switching corpus, a 34h transcribed speech corpus of read Wikipedia articles which can be used as a benchmark for research on code-switching.&amp;nbsp; The articles are read by a large and diverse group of people. The SWC is perhaps the largest corpus of freely-available aligned speech for German.&amp;nbsp; It contains 1014 spoken articles read by more than 350 identified speakers comprising 386h of speech. This corpus is available at http://nats.gitlab.io/swc.&lt;/p&gt;

&lt;p&gt;In SWC, since most of the articles are long, the recordings submitted by the volunteers are also long (&amp;sim;54min) on average.&amp;nbsp; These audio files are manually annotated at word-level and also segment level in XML format.&amp;nbsp; We use a language identification tool to detect code-switching in the transcription of the audio files with consecutive indices. To extract intra-sentential code-switching segments, we ensure that the detected code-switching is preceded and followed by German words or sentences. The final set consists of 34h of speech data and 12,437 code-switching segments (in Kaldi ASR toolkit data format).&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Citation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;@article{baumann2019spoken,&lt;br&gt;
&amp;nbsp; title={The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening},&lt;br&gt;
&amp;nbsp; author={Baumann, Timo and K{\&amp;quot;o}hn, Arne and Hennig, Felix},&lt;br&gt;
&amp;nbsp; journal={Language Resources and Evaluation},&lt;br&gt;
&amp;nbsp; volume={53},&lt;br&gt;
&amp;nbsp; number={2},&lt;br&gt;
&amp;nbsp; pages={303--329},&lt;br&gt;
&amp;nbsp; year={2019},&lt;br&gt;
&amp;nbsp; publisher={Springer}&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;@article{grave2018learning,&lt;br&gt;
&amp;nbsp; title={Learning word vectors for 157 languages},&lt;br&gt;
&amp;nbsp; author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},&lt;br&gt;
&amp;nbsp; journal={arXiv preprint arXiv:1802.06893},&lt;br&gt;
&amp;nbsp; year={2018}&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.34777/bkr1-ay03</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
262
108
views
downloads
Views 262
Downloads 108
Data volume 112.7 MB
Unique views 239
Unique downloads 85

Share

Cite as