Voice of America: Ukrainian ASR Dataset of Broadcast Speech

Published December 6, 2022 | Version 1.0.2

Dataset Open

The dataset is based on public recordings of Voice of America (https://ukrainian.voanews.com) extracted from their videos.

The dataset contains 398 hours of speech.

The dataset is created by the ASR Corpus Creator (https://zenodo.org/record/7396705).

The format of files: WAV with 16 kHz.

Files

Name	Size	Download all
1_1.zip md5:8fc5518a31e4686b16b8f9447de4b3de	9.5 GB	Preview Download
1_2.zip md5:cdb0a8072f11babc9aaf3463563c2cb2	5.3 GB	Preview Download
2.zip md5:d47ba6b2bafcc7886899abf7e6ec911f	6.9 GB	Preview Download
3.zip md5:462045dedd8f084d9e34de2d65aeb160	6.3 GB	Preview Download
4.zip md5:c8e0562572a17113de27e6c4cb779409	6.8 GB	Preview Download
5.zip md5:d692cd1629322ec67b233c1802ccc509	6.8 GB	Preview Download
voa_clean.jsonl md5:13e486a0fe76ef0b519485bf6d6f4ef1	127.2 MB	Download
voa_dataset.jsonl md5:920447b4bd9056612dfe6b877083ba89	147.6 MB	Download

Speech Recognition for Ukrainian, https://github.com/egorsmkv/speech-recognition-uk
Smoliakov, Yehor. (2022). ASR Corpus Creator (1.5.1). Zenodo. https://doi.org/10.5281/zenodo.7396705