00000nmm##2200000uu#4500 4434251 doi 10.34777/bkr1-ay03 oai:zenodo.org:4434251 user-idiap Garner, Philip N. (orcid)0000-0002-0814-1348 Idiap Research Institute Code-Switching Speech Corpus Khosravani, Abbas (orcid)0000-0002-7108-2475 Idiap Research Institute info:eu-repo/semantics/openAccess Creative Commons Attribution Share Alike 3.0 Unported https://creativecommons.org/licenses/by-sa/3.0/legalcode cc-by-sa-3.0 spdx Automatic Speech recognition German Spoken Wikipedia Corpu Read speech corpus German-English Code-Switching speech dataset We provide means to resegment a subset of the German **Spoken Wikipedia Corpus** (SWC) enabling a particular focus on code-switching.  This results in the German-English code-switching corpus, a 34h transcribed speech corpus of read Wikipedia articles which can be used as a benchmark for research on code-switching.  The articles are read by a large and diverse group of people. The SWC is perhaps the largest corpus of freely-available aligned speech for German.  It contains 1014 spoken articles read by more than 350 identified speakers comprising 386h of speech. This corpus is available at http://nats.gitlab.io/swc. In SWC, since most of the articles are long, the recordings submitted by the volunteers are also long (&sim;54min) on average.  These audio files are manually annotated at word-level and also segment level in XML format.  We use a language identification tool to detect code-switching in the transcription of the audio files with consecutive indices. To extract intra-sentential code-switching segments, we ensure that the detected code-switching is preceded and followed by German words or sentences. The final set consists of 34h of speech data and 12,437 code-switching segments (in Kaldi ASR toolkit data format).   Citation @article{baumann2019spoken,   title={The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening},   author={Baumann, Timo and K{\"o}hn, Arne and Hennig, Felix},   journal={Language Resources and Evaluation},   volume={53},   number={2},   pages={303--329},   year={2019},   publisher={Springer} } @article{grave2018learning,   title={Learning word vectors for 157 languages},   author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},   journal={arXiv preprint arXiv:1802.06893},   year={2018} }   deu Zenodo 2021-01-12 user-idiap info:eu-repo/semantics/other 20210113033712.0 1426652 md5:ee84d7577cb7bcb080fae2e0a903b700 https://zenodo.org/records/4434251/files/code-switching.tar.gz 56 md5:7007facb8694b47b27097fd0b73ce12f https://zenodo.org/records/4434251/files/MD5SUM.txt open