Published August 16, 2023 | Version 1.0
Dataset Open

InsectSet47 & InsectSet66: Expanded datasets for automatic acoustic identification of insects (Orthoptera and Cicadidae)

  • 1. Naturalis Biodiversity Center, Leiden University

Contributors

Supervisor:

  • 1. FLORON, Nijmegen
  • 2. Natural History Museum, London
  • 3. Naturalis Biodiversity Centre, Leiden

Description

Updated full version with training, validation and test sets.

Two newly compiled datasets for training neural networks to automatically identify insect species while comparing adaptive, waveform-based frontends to conventional mel-spectrogram frontends for audio feature extraction. This work was published in PLOS Computational Biology and the machine learning implementations were published on Github.

These datasets expand on the previously published InsectSet32 by including recently published collections of insect recordings by citizen scientists from around the world. Recordings from BioAcousticaxeno-canto and iNaturalist, as well as private collections by Baudewijn Odé were downloaded and manually inspected. Files with strong noise interference or intense filtering, as well as files containing sounds of multiple species were removed to compile these datasets. The files were standardised to 44.1 kHz mono WAV files ranging in length from less than one second to several minutes. Files containing long periods without insect sounds were edited into multiple smaller files with silent periods no longer than 5 seconds. These files are marked as edits in the annotation file and should be assigned together into train/validation/test sets to prevent data leakage. The annotation files contain information for each recording, including the file name, species name and identifier, as well as the data subset they were included in for training the neural network (training, test, validation).

InsectSet47 expands on InsectSet32 with recordings from xeno-canto and contains 1006 original recordings from 47 species, with at least ten files per species. The total length of InsectSet47 is 22 hours. InsectSet66 further expands on InsectSet47 by adding research-grade audio observations from iNaturalist, with a total of 1554 recordings from 66 species, a total length of over 24 hours and a minimum of ten files per species.

The datasets were split into the training, validation and test sets while ensuring a roughly equal distribution of audio files and audio material for every species in all three subsets. This resulted in a 60/20/20 split (train/validation/test) by file number and a 64/19.5/16.5 split by file length.

Files

InsectSet47_Train_Val_Test.zip

Files (9.9 GB)

Name Size Download all
md5:f2b5d8fcf237666e682403d18a394542
4.7 GB Preview Download
md5:58af210ef60753e5c738aa857016c3b5
376.4 kB Preview Download
md5:83d23ff76e5ce856fe727f1c982e43a4
5.2 GB Preview Download
md5:a0240c6b625196f98cddb989a28f9e93
492.4 kB Preview Download
md5:7414d6e819279b663956bdf51fbdc190
7.7 kB Preview Download