Published September 12, 2022 | Version 0.1
Dataset Open

InsectSet32: Dataset for automatic acoustic identification of insects (Orthoptera and Cicadidae)

  • 1. Leiden University

Contributors

Supervisor:

  • 1. FLORON, Nijmegen
  • 2. Natural History Museum, London
  • 3. Naturalis Biodiversity Centre, Leiden

Description

This dataset contains recordings of 32 sound producing insect species with a total 335 files and a length of 57 minutes. The dataset was compiled for training neural networks to automatically identify insect species while comparing adaptive, waveform-based frontends to conventional mel-spectrogram frontends for audio feature extraction. This work was published in PLOS Computational Biology and this dataset can be used to replicate the results, as well as other uses. The scripts for audio processing and the machine learning implementations are published on Github.

The recordings are split into two datasets. Roughly half of the recordings (147) are of nine species belonging to the order Orthoptera. These recordings stem from a dataset that was originally compiled by Baudewijn Odé (unpublished). 

The remaining recordings (188) are of 23 species in the family Cicadidae. These recordings were selected from the Global Cicada Sound Collection hosted on Bioacoustica (doi.org/10.1093/database/bav054), including recordings published in doi.org/10.3897/BDJ.3.e5792 & doi.org/10.11646/zootaxa.4340.1. Many recordings from this collection included speech annotations in the beginning of the recordings, therefore the last ten seconds of audio were extracted and used in this dataset. 

All files were manually inspected and files with strong noise interference or with sounds of multiple species were removed. Between species, the number of files ranges from four to 22 files and the length from 40 seconds to almost nine minutes of audio material for a single species. The files range in length from less than one second to several minutes. All original files were available with sample rates of at least 44.1 kHz or higher but were resampled to 44.1 kHz mono WAV files for consistency. The annotation files contain information for each recording, including the file name, species name and identifier, as well as the data subset they were included in for training the neural network (training, test, validation).

Files

Cicadidae.csv

Files (268.4 MB)

Name Size Download all
md5:e56daabc4399a869d7feaae0be88eb70
27.2 kB Preview Download
md5:d88b84913c9bcb7f312d3c07ad80ff05
138.3 MB Preview Download
md5:a09291e3aae6bff4dfcdf880b9f26485
12.6 kB Preview Download
md5:67344bac0b799e6b51dd901b322ad823
130.1 MB Preview Download
md5:90928aba29b4c91a06d0e197d23a1ecb
5.0 kB Preview Download