InsectSet459: A large dataset for automatic acoustic identification of insects (Orthoptera and Cicadidae)
Authors/Creators
Description
*Version 1.1: The full version of the dataset, including the test data, is here! Now including temperature data where available.
Background
In 2024, the public animal sound database xeno-canto has seen a dramatic increase in insect sound recordings. This is due to the publication of several large collections of field and laboratory recordings from insect sound experts, as well as increased adoption of citizen scientists uploading their insect sound observations to the website. We used this opportunity to expand our previously published datasets (InsectSet32, InsectSet47&InsectSet66) to compile the first large-scale dataset of insect sounds that is easy to use for training deep learning methods to detect and classify insect sounds in the wild. A short pre-print describing the dataset curation and characteristics in more detail, as well as results from two baseline classifiers trained on the datasets, is accessible here and is being submitted for publication in a journal.
Data curation
Recordings from xeno-canto (Orthoptera), iNaturalist (Orthoptera & Cicadidae) and BioAcoustica (Cicadidae) were downloaded and pooled together. Several selection steps were chosen to compile a final selection of recordings. From iNaturalist, only research-grade observations were downloaded. For observations with multiple audio files attached, only one file was downloaded. If users uploaded to both iNaturalist and xeno-canto, only the files from one of the platforms were used. To further avoid duplicate uploads, a checksum test was applied to the entire source dataset. Another common occurrence is serial uploads from one location and time period split into separate observations (especially common on xeno-canto), which could include the same individual animals vocalizing. This problem was adressed by pooling all recordings by username, species, geographic location, date and time, and selecting only one recording from a one-hour period.
After these filtering steps, all files from species with at least 10 sound examples were selected for the final dataset. All stereo files were converted to mono, file formats were standardized to wav and mp3. Recordings of a length longer than two minutes were automatically trimmed. Species nomenclature was unified to COL24.4 2024-04-26 [294826] using checklistbank.
This new dataset greatly increases the number of species included: from 66 in InsectSet66 to now contain 459 unique species from the groups Orthoptera and Cicadidae, while also strongly increasing the geographic coverage of recording locations. The total duration of the dataset and number of sound examples is heavily expanded to a total of 26298 files containing 9.5 days of audio material with sample rates ranging from 8 to 500 kHz.
The code used to compile this dataset is available on Github.
Dataset Usage
All recordings are licensed under creative commons licenses 4.0 or 0. We excluded no-derivatives licenses to simply further usage of this dataset. For machine-learning purposes, the dataset was split into the training, validation and test sets while ensuring a roughly equal distribution of audio files and audio material for every species in all three subsets. This resulted in a 60/20/20 split (train/validation/test) by file number and file length.
*Version 1.1: The full version of the dataset, including the test data, is here! Now including temperature data where available.
Files
InsectSet459_Train_Val_Test_Annotation.csv
Files
(83.7 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:f3cb378e2c3b8ba5704f37ee0bb6d8a6
|
6.9 MB | Preview Download |
|
md5:e2037e90f51b35eb89bb9cea516c0a6d
|
16.0 GB | Preview Download |
|
md5:46f535ca43559bcd87b45d6fab189c5f
|
51.2 GB | Preview Download |
|
md5:3b7fbbb4b2e764a5020f76b7ce1ba9ae
|
20.5 kB | Preview Download |
|
md5:0027a6ae81bd5b06526339efcfa285d5
|
16.4 GB | Preview Download |
Additional details
Related works
- Continues
- Dataset: 10.5281/zenodo.8252141 (DOI)
- Dataset: 10.5281/zenodo.7072196 (DOI)
- Journal article: 10.1371/journal.pcbi.1011541 (DOI)
- Is described by
- Preprint: 10.48550/arXiv.2503.15074 (DOI)
Dates
- Updated
-
2026-02-09
Software
- Repository URL
- https://github.com/mariusfaiss/InsectSet459
- Programming language
- Python