Published January 11, 2023 | Version 1
Dataset Open

A collection of fully-annotated soundscape recordings from neotropical coffee farms in Colombia and Costa Rica

  • 1. Center for Avian Population Studies, Cornell Lab of Ornithology, Cornell University
  • 2. K. Lisa Yang Center for Conservation Bioacoustics, Cornell Lab of Ornithology, Cornell University
  • 3. Laboratorio de Ecología Urbana y Comunicación Animal, Escuela de Biología, Universidad de Costa Rica
  • 4. Laboratorio de Ecología y Evolución de Vertebrados, Instituto de Biología, Universidad de Antioquia

Description

This collection contains 34 hour-long soundscape recordings, which have been annotated by expert ornithologists who provided 6,952 bounding box labels for 89 different bird species from Colombia and Costa Rica. The data were recorded in 2019 at two highly diverse neotropical coffee farm landscapes from the towns of Jardín, Colombia and San Ramon, Costa Rica. This collection has partially been featured as test data in the 2021 BirdCLEF competition and can primarily be used for training and evaluation of machine learning algorithms.

Data collection

Monitoring the avifauna of coffee farms is a useful tool to measure the impact of sustainability efforts in productive landscapes. Diverse bird communities provide services to coffee farms that are enhanced with increased tree cover (e.g. shade trees, wind breaks), and the protection of forest remnants within or in close proximity to coffee farms. Our limited knowledge on the relative contributions of different types of tree cover has incentivized the scientific community to collect bird data in coffee landscapes in order to correlate different management strategies with the presence or absence of target species using bioacoustics. To accomplish this, we collected a set of bird recordings to track the presence of target species on coffee farms. The annotated data set is currently being used to measure the impact of pesticide applications and other types of management practice (e.g. pruning) to test for differences in bird activity before and after one of these interventions. In addition, the bird call annotations help us train machine learning models that will help us monitor these farms in an automatic way.

Soundscapes for this collection were recorded using SWIFT recorders, positioned 3m above the ground. We recorded 48kHz one-hour long sound files from 4:30 to 7:30 to capture the most active time frame of the avian dawn chorus. In the same way, we recorded from 16:00 to 19:00, to capture the second avian bioacoustics activity peak that occurs before sunset, and to include nocturnal species. 

All audio was unified, converted to FLAC, and resampled to 32 kHz for this collection. Parts of this dataset have previously been used in the 2021 BirdCLEF competition.

Sampling and annotation protocol

We subsampled data for this collection by randomly selecting recordings coming from different farm locations and dates.

Using Raven Pro, annotators were asked to create a selection box around every bird call they could recognize, ignoring those that were too faint or unidentifiable. We allowed overlapping selections. Provided labels contain full bird calls that are boxed in time and frequency. Annotators were allowed to combine multiple consecutive calls of the same species into one bounding box label if pauses between calls were shorter than 5 seconds. We converted labels to eBird species codes, following the 2021 eBird taxonomy (Clements list). Unidentifiable calls have been marked with “????” and were added as bounding box labels to the ground truth annotations.

Files in this collection

Audio recordings can be accessed by downloading and extracting the “soundscape_data.zip” file. Soundscape recording filenames contain a sequential file ID, site ID, recording date, and timestamp in local time (Costa Rica: GMT-6; Colombia: GMT-5). As an example, the file “NES_001_S01_20190914_043000.flac” has sequential ID 001 and was recorded at site S01 on Sep 14th, 2019 at 04:30:00 local time. Ground truth annotations are listed in “annotations.csv” where each line specifies the corresponding filename, start and end time in seconds, low and high frequency in Hertz, and an eBird species code. These species codes can be assigned to the scientific and common name of a species with the “species.csv” file. Geographical coordinates of San Ramon and Jardín regions can be found in the “recording_location.txt” file. Recording location coordinates are not included due to data privacy.

Acknowledgements 

Compiling this extensive dataset was a major undertaking, and we are very thankful to the domain experts who helped to collect and manually annotate the data for this collection. Specifically, we want to thank (in alphabetical order): Alejandro Quesada, José Castaño, Luis Parra. We also thank Carlos Gamboa-Venegas (RedCONARE, CeNAT), for providing technical assistance for information transfer.

We would also like to acknowledge our funding source: Nespresso AAA Sustainable QualityTM Program.

Files

annotations.csv

Files (3.8 GB)

Name Size Download all
md5:f72be6c60b530fb622ea954627edea47
465.6 kB Preview Download
md5:ff753e924a649639194b16cb7bbeffa8
130.4 kB Preview Download
md5:3b62987e3195f717a381971c9f306f05
177 Bytes Preview Download
md5:15a00bb3221e56ec83d38ad04c26ce68
3.8 GB Preview Download
md5:4e5d1bbf1ae8124a28fbbef69cdd1441
4.3 kB Preview Download