There is a newer version of this record available.

Dataset Open Access

The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms

Lara Orlandic; Tomas Teijeiro; David Atienza

Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. Furthermore, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world’s most urgent health crises.

For more information about the data collection, pre-processing, validation, and data structure, please refer to the following publication: The cough pre-processing and feature extraction code is available from the following c4science  repository:
Files (951.4 MB)
Name Size
951.4 MB Download
All versions This version
Views 8,6066,025
Downloads 4,1773,247
Data volume 4.3 TB3.1 TB
Unique views 7,2095,280
Unique downloads 3,1602,439


Cite as