InsectSet47 & InsectSet66: Expanded datasets for automatic acoustic identification of insects (Orthoptera and Cicadidae)
- 1. FLORON, Nijmegen
- 2. Natural History Museum, London
- 3. Naturalis Biodiversity Centre, Leiden
Description
Updated full version with training, validation and test sets.
Two newly compiled datasets for training neural networks to automatically identify insect species while comparing adaptive, waveform-based frontends to conventional mel-spectrogram frontends for audio feature extraction. This work was published in PLOS Computational Biology and the machine learning implementations were published on Github.
These datasets expand on the previously published InsectSet32 by including recently published collections of insect recordings by citizen scientists from around the world. Recordings from BioAcoustica, xeno-canto and iNaturalist, as well as private collections by Baudewijn Odé were downloaded and manually inspected. Files with strong noise interference or intense filtering, as well as files containing sounds of multiple species were removed to compile these datasets. The files were standardised to 44.1 kHz mono WAV files ranging in length from less than one second to several minutes. Files containing long periods without insect sounds were edited into multiple smaller files with silent periods no longer than 5 seconds. These files are marked as edits in the annotation file and should be assigned together into train/validation/test sets to prevent data leakage. The annotation files contain information for each recording, including the file name, species name and identifier, as well as the data subset they were included in for training the neural network (training, test, validation).
InsectSet47 expands on InsectSet32 with recordings from xeno-canto and contains 1006 original recordings from 47 species, with at least ten files per species. The total length of InsectSet47 is 22 hours. InsectSet66 further expands on InsectSet47 by adding research-grade audio observations from iNaturalist, with a total of 1554 recordings from 66 species, a total length of over 24 hours and a minimum of ten files per species.
The datasets were split into the training, validation and test sets while ensuring a roughly equal distribution of audio files and audio material for every species in all three subsets. This resulted in a 60/20/20 split (train/validation/test) by file number and a 64/19.5/16.5 split by file length.
Files
InsectSet47_Train_Val_Test.zip
Files
(9.9 GB)
Name | Size | Download all |
---|---|---|
md5:f2b5d8fcf237666e682403d18a394542
|
4.7 GB | Preview Download |
md5:58af210ef60753e5c738aa857016c3b5
|
376.4 kB | Preview Download |
md5:83d23ff76e5ce856fe727f1c982e43a4
|
5.2 GB | Preview Download |
md5:a0240c6b625196f98cddb989a28f9e93
|
492.4 kB | Preview Download |
md5:7414d6e819279b663956bdf51fbdc190
|
7.7 kB | Preview Download |