Published September 1, 2022 | Version 1.0.0
Dataset Open

Introducing the COVID-19 YouTube (COVYT) speech dataset featuring the same speakers with and without infection

Description

The COVYT dataset contains speech samples from individuals who self-reported their COVID-19 infection on public social media platforms (YouTube, Xiaohongshu). These videos, as well as accompanying videos of the same people prior to infection, were mined in an attempt to gather publicly-available data for COVID-19 research. This release includes the links to the original videos along with the accompanying manual segmentation and diarisation that identifies the utterances of the target individuals. We are additionally releasing features derived from the segmented utterances. Finally, the dataset includes partitioning information according to 4 different cross-validation schemes. See the arxiv pre-print for more details: https://arxiv.org/abs/2206.11045

Files

COVYT.zip

Files (330.4 MB)

Name Size Download all
md5:d4008a76ae7f1af5967114f259439fe7
330.4 MB Preview Download

Additional details

Related works

Is described by
Preprint: https://arxiv.org/abs/2206.11045 (URL)

Funding

sustAGE – Smart environments for person-centered sustainable work and well-being 826506
European Commission