DataSED - Dataset for Sound Event Detection of environmental noise

Fredianelli, Luca; Artuso, Francesco; Pompei, Geremia; Licitra, Gaetano; Iannace, Gino; Akbaba, Andaç

doi:10.5281/zenodo.15346092

Published May 5, 2025 | Version v1

Video/Audio Open

DataSED - Dataset for Sound Event Detection of environmental noise

1. Institute for Chemical-Physical Processes of the Italian Research Council
2. University of Pisa - Physics Department
3. University of Campania "Luigi Vanvitelli"

Contributors

Data curator:

Akbaba, Andaç¹

Data manager:

Fredianelli, Luca²

1. University of Campania "Luigi Vanvitelli"
2. Institute for Chemical-Physical Processes of the Italian Research Council

The field of Sound Event Detection (SED) has recently attracted significant attention from the academic community, with its applications extending to a variety of disciplines, including environmental acoustics. Of particular interest is a notable increase in the level of interest in outdoor measurements, with a view to distinguishing the contributions of a particular source from the background noise. The utilisation of machine learning tools is contingent upon the availability of substantial datasets for the purposes of training or validation. DataSED is an open-access dataset specifically designed for Sound Event Detection of environmental noise that can be listened in outdoor environments.

The collection under consideration consists of 717 .wav audio tracks, with a sampling rate of 44.1 kHz. The ground truth labelling is in .csv format. All files consist of non-synthesized audio recordings, which have been meticulously gathered from two distinct sources: sound level measurements and online repositories. The authors conducted a comprehensive analysis of the sound samples, encompassing a diverse range of environments, from urban to rural settings. Two versions of the labels are provided to support the development and evaluation of two types of application: monophonic and polyphonic sound detection. The first version does not contain overlapping events of different classes, meaning that each moment in time is assigned to a single class only. The second version incorporates overlapping events from multiple classes, providing a more realistic representation of real-world conditions. The labels were manually annotated by experts through a rigorous process to ensure high quality and usability. The monophonic version encompasses the complete set of 22 defined sound classes: Bells; Birds; Cat fights and moans; Chicken coop; Cicadas and crickets; Crows, seagulls and magpies; Dog barkings and howlings; Glass breaking; Horn; Jet aircrafts; Lawn mower, brush cutter and olive shaker; Music; Propeller aircrafts; Sirens and alarms; Thunder, fireworks and gunshot; Train; Vacuum cleaner, fan and hairdryer; Vehicle idling; Vehicle pass-by; Voices; Wind turbine; Workshop. The polyphonic version, however, comprises 21 sound classes, with the exclusion of the Wind turbine category. This exclusion arises from the nature of wind turbine recordings and the associated audio files have been designated for monophonic processing only.

The development of DataSED was initiated with the objective of facilitating temporal analysis of multiple sound events within real-world acoustic scenarios. The collection comprises continuous and unsegmented sound recordings, incorporating simultaneous sound events. In certain instances, tracks have been manually modified to remove extended periods of minimal or no change, while sounds have been manually added to achieve a more dynamic effect. It is not obligatory for the entirety of the track to be comprehensively labelled, as there may be periods where no source is identified. This means that there may be silences or periods of unrecognisable sound.

The minimum number of entries required for each class has been set at 100.
The files have been organised into a single folder directory and have been named in a consistent pattern, ranging from S-0001 to S-0717, where S indicates "Sample". The shortest audio clip is 2.29 seconds, while the longest is 285.0 seconds, with an average audio duration of 87.18 seconds across all clips. The total duration of the audio content is approximately 17.02 hours.

Two versions of the labels are uploaded to support the development and evaluation of two types of application: monophonic and polyphonic sound detection. The first version does not contain overlapping events of different classes, meaning that each moment in time is assigned to a single class only. The second version incorporates overlapping events from multiple classes, providing a more realistic representation of real-world conditions.

The authors hypothesise that the dataset will contribute to future research in real-world sound event analysis and automated acoustic evaluation by means of machine learning.

Files