Published February 9, 2026 | Version 1.1
Dataset Open

InsectSet459: A large dataset for automatic acoustic identification of insects (Orthoptera and Cicadidae)

  • 1. ROR icon Max Planck Institute of Animal Behavior
  • 2. ROR icon Naturalis Biodiversity Center

Description

*Version 1.1: The full version of the dataset, including the test data, is here! Now including temperature data where available.

Background

In 2024, the public animal sound database xeno-canto has seen a dramatic increase in insect sound recordings. This is due to the publication of several large collections of field and laboratory recordings from insect sound experts, as well as increased adoption of citizen scientists uploading their insect sound observations to the website. We used this opportunity to expand our previously published datasets (InsectSet32, InsectSet47&InsectSet66) to compile the first large-scale dataset of insect sounds that is easy to use for training deep learning methods to detect and classify insect sounds in the wild. A short pre-print describing the dataset curation and characteristics in more detail, as well as results from two baseline classifiers trained on the datasets, is accessible here and is being submitted for publication in a journal.

Data curation

Recordings from xeno-canto (Orthoptera), iNaturalist (Orthoptera & Cicadidae) and BioAcoustica (Cicadidae) were downloaded and pooled together. Several selection steps were chosen to compile a final selection of recordings. From iNaturalist, only research-grade observations were downloaded. For observations with multiple audio files attached, only one file was downloaded. If users uploaded to both iNaturalist and xeno-canto, only the files from one of the platforms were used. To further avoid duplicate uploads, a checksum test was applied to the entire source dataset. Another common occurrence is serial uploads from one location and time period split into separate observations (especially common on xeno-canto), which could include the same individual animals vocalizing. This problem was adressed by pooling all recordings by username, species, geographic location, date and time, and selecting only one recording from a one-hour period. 

After these filtering steps, all files from species with at least 10 sound examples were selected for the final dataset. All stereo files were converted to mono, file formats were standardized to wav and mp3. Recordings of a length longer than two minutes were automatically trimmed. Species nomenclature was unified to COL24.4 2024-04-26 [294826] using checklistbank.

This new dataset greatly increases the number of species included: from 66 in InsectSet66 to now contain 459 unique species from the groups Orthoptera and Cicadidae, while also strongly increasing the geographic coverage of recording locations. The total duration of the dataset and number of sound examples is heavily expanded to a total of 26298 files containing 9.5 days of audio material with sample rates ranging from 8 to 500 kHz.

The code used to compile this dataset is available on Github.

Dataset Usage

All recordings are licensed under creative commons licenses 4.0 or 0. We excluded no-derivatives licenses to simply further usage of this dataset. For machine-learning purposes, the dataset was split into the training, validation and test sets while ensuring a roughly equal distribution of audio files and audio material for every species in all three subsets. This resulted in a 60/20/20 split (train/validation/test) by file number and file length.

*Version 1.1: The full version of the dataset, including the test data, is here! Now including temperature data where available.

Files

InsectSet459_Train_Val_Test_Annotation.csv

Files (83.7 GB)

Name Size Download all
md5:f3cb378e2c3b8ba5704f37ee0bb6d8a6
6.9 MB Preview Download
md5:e2037e90f51b35eb89bb9cea516c0a6d
16.0 GB Preview Download
md5:46f535ca43559bcd87b45d6fab189c5f
51.2 GB Preview Download
md5:3b7fbbb4b2e764a5020f76b7ce1ba9ae
20.5 kB Preview Download
md5:0027a6ae81bd5b06526339efcfa285d5
16.4 GB Preview Download

Additional details

Related works

Continues
Dataset: 10.5281/zenodo.8252141 (DOI)
Dataset: 10.5281/zenodo.7072196 (DOI)
Journal article: 10.1371/journal.pcbi.1011541 (DOI)
Is described by
Preprint: 10.48550/arXiv.2503.15074 (DOI)

Dates

Updated
2026-02-09

Software

Repository URL
https://github.com/mariusfaiss/InsectSet459
Programming language
Python