Voice of America: Ukrainian ASR Dataset of Broadcast Speech

Published December 6, 2022 | Version 1.0.0

Dataset Open

The dataset is based on public recordings of Voice of America (https://ukrainian.voanews.com) extracted from their videos.

The dataset contains 398 hours of speech.

The dataset is created by the ASR Corpus Creator (https://zenodo.org/record/7396705).

The format of files: WAV with 16 kHz.

Files

Name	Size	Download all
voa_clean.jsonl md5:13e486a0fe76ef0b519485bf6d6f4ef1	127.2 MB	Download
voa_dataset.jsonl md5:920447b4bd9056612dfe6b877083ba89	147.6 MB	Download

Speech Recognition for Ukrainian, https://github.com/egorsmkv/speech-recognition-uk
Smoliakov, Yehor. (2022). ASR Corpus Creator (1.5.1). Zenodo. https://doi.org/10.5281/zenodo.7396705