Published May 6, 2024 | Version v1
Software Open

Vocalization data and scripts to model reindeer rut activity using on-animal acoustic recorders and machine learning

  • 1. Concordia University
  • 2. Norwegian University of Life Sciences
  • 3. Natural Resources Institute Finland

Description

For decades, researchers have employed sound to study the biology of wildlife, with the aim to better understand their ecology and behaviour. By utilizing on-animal recorders to capture audio from freely moving animals, scientists can decipher the vocalizations and glean insights into their behaviour and ecosystem dynamics through advanced signal processing. However, the laborious task of sorting through extensive audio recordings has been a major bottleneck. To expedite this process, researchers have turned to machine learning techniques, specifically neural networks, to streamline the analysis of data. Nevertheless, much of the existing research has focused predominantly on stationary recording devices, overlooking the potential benefits of employing on-animal recorders in conjunction with machine learning. To showcase the synergy of on-animal recorders and machine learning, we conducted a study at the Kutuharju research station in Kaamanen, Finland, where the vocalizations of rutting reindeer were recorded during their mating season. By attaching recorders to seven male reindeer during the rutting periods of 2019 and 2020, we trained convolutional neural networks to distinguish reindeer grunts with a 95% accuracy rate. This high level of accuracy allowed us to examine the reindeers' grunting behaviour, revealing patterns indicating that older, heavier males vocalized more compared to their younger, lighter counterparts. The success of this study underscores the potential of on-animal acoustic recorders coupled with machine learning techniques as powerful tools for wildlife research, hinting at their broader applications with further advancement and optimization.

Notes

Funding provided by: Natural Sciences and Engineering Research Council
Crossref Funder Registry ID: https://ror.org/01h531d29
Award Number: 327505

Funding provided by: NordForsk
Crossref Funder Registry ID: https://ror.org/05bqzfg94
Award Number: 76915

Methods

Bioacoustics data were collected during the rutting seasons of 2019 and 2020. In 2019, vocalizations were captured from two males, while in 2020, vocalizations were obtained from six males. However, the sampling time varied from three days to two months due to equipment issues. Moreover, challenges with the recorders limited our data collection to only seven male reindeer (two from 2019 and five from 2020).

During the translocation of the males, they were outfitted with on-animal acoustic recorders. These devices housed SOROKA-15E recording units from TS-Market Ltd., Zelenograd, Russia, capable of capturing the animals' vocalizations with an amplitude resolution of 16 bits and a sampling rate of 16 kHz (Figure 2). The recorders enabled us to obtain continuous audio data throughout the breeding period. For storage purposes, each recorder was equipped with a 256-gigabyte microSD card, providing the capacity to record over 92 days of audio. Furthermore, a 9,000-milliamp-hour, 3.6-volt lithium-ion battery powered each recorder, ensuring functionality for over two months.

Observers documented each male's status and harem during the ruts. Based on data gathered from agonistic interactions among males recorded during the rut, a clear dominance hierarchy was evident, established primarily on win-lose frequencies. The size of a male's harem was noted throughout these observations. Dominant males, commanding a harem, were given a corresponding label, whereas those without were categorized as subdominant. Fluctuations in dominance among numerous males occurred based on the presence or absence of more dominant males. Dominance during the rut was mainly determined through observations. Observers monitored the males' status for about 20% of the rutting period, providing the basis for audio playbacks. Specific sounds and behaviors were documented during these observed hours, serving as a model to categorize the remaining recordings according to these known patterns. Each hour of each male was analyzed through audio playback, with status potentially changing based on displayed behaviors and grunts. Typically, a male's status remained constant for extended periods, with observed shifts from dominant to subdominant and back to dominant occurring in as quickly as three days. Notably, males did not continuously alternate statuses during these periods, and when a male lost a harem to another male, the displaced male often remained subdominant for an extended period.

In cases where a male was not observed for periods (owing to them occasionally being unlocatable), assessments of their status were conducted during playback of their recordings. For instance, if a playback revealed a male grunting with audible background sounds of females and calves, it was inferred that he held a dominant status. Conversely, the presence of rival male grunts directed at the subject male, if met with silence on the latter's part, suggested his sub-dominance. If a recording of a male featured no other sounds and he wasn't grunting, it was presumed he was in search of a herd and thus subdominant. Nevertheless, in instances where a male's status was ambiguous during playback and no observational data was available, his status was marked as 'unknown.' Confirmation of such males' statuses was deferred until more definitive evidence could be obtained.

We employed Sonic Visualiser (Cannam et al., 2010) for the analysis and annotation of the recordings. Each recording underwent annotation using the "boxes layer" feature, enabling the delineation of intentional and unintentional noises via bounding boxes. This allowed for the documentation of start and end times for each sound within the recordings, as well as the minimum and maximum frequency of each vocalization. Notably, a binary label (presence or absence) was assigned to each bounding box. For the presence class, bounding boxes were strategically positioned around individual and repeating vocalizations to capture the grunting behaviours of the reindeer. Incidental calls from other reindeer were intentionally left unannotated to prevent an excessive impact on the CNN training due to the over-sampling of activities of the focal individuals. Bounding boxes of varying lengths were placed and utilized to annotate the absence class throughout each recording, ensuring comprehensive CNN training by encompassing a wide range of unintentional sounds (biophony, geophony, and anthropophony). The bounding boxes were strategically positioned between presence segments to capture a wide range of sounds. Throughout the process, we made a conscious effort to exclude segments without any noise to enhance the precision of the final classifier. Furthermore, within the absence class, vocalizations of any species overlapping with the males' frequency range were annotated. Subsequently, Sonic Visualiser was utilized to validate the vocalization predictions made by the CNN and to eliminate any false positive or false negative predictions.

Machine Learning methodology and process:

We utilized a supervised learning approach to train our CNNs, employing code and methodology adapted from Dufourq et al. (2022). Initially, we annotated 135 audio segments from our 2020 recordings, amounting to 25 recordings from five individuals and 10 from another. Unfortunately, due to technical issues with one of the recorders, we had to exclude the annotations from this individual; their inclusion led to a decrease in the network's performance. Owing to limitations in computer hardware, we worked with a subset of the 125 annotated audio segments, specifically chosen to encompass a wide range of recordings across individuals, from various environments and weather conditions. Consequently, our 2020 training set comprised 8,605 presence segments (augmented to 14,000; augmentation of data was done through time-shifting of existing presence annotations) and 18,000 absence segments, with 14,000 being randomly sampled to ensure a balanced dataset. This preliminary network was then utilized to extract vocalizations from our 2019 recordings.

After annotating the recordings, we  had to train our CNN,  thus we searched for appropriate hyper-parameters. After experimenting with over 40 different hyper-parameter combinations, the values outlined in Table 2 yielded the best-performing network. In the case of CNNs, fixed inputs are required, and given the considerable variation in the duration of multiple grunts—ranging from a fraction of a second to over 30 seconds—window lengths exceeding four seconds did not improve performance. Since grunt variations within a series of grunts are minimal, segments lasting longer than four seconds led to increased computation time without any performance gains. During our hyper-parameter search, we also focused on determining the minimum and maximum call frequency values. Given that the fundamental frequency of reindeer grunts falls below 100 Hz, we set the minimum frequency to zero Hz while the maximum frequency was set to 4000 Hz. While formant frequencies of reindeer grunts tend to be indiscernible after 2500 Hz, setting the maximum frequency to 4000 Hz enhanced network performance, likely due to the provision of supplementary information that aided in distinguishing grunts from unintentional noises. However, setting the maximum frequency above 4000 Hz did not improve network performance and instead led to increased computation time.

Four pre-processing steps were then conducted to prepare the inputs for the CNNs, mirroring those detailed in Dufourq et al. (2022). Firstly, a low-pass filter was applied to each audio file to capture signals below its cut-off frequency, which minimizes aliasing artifacts that can stem from downsampling. The filter's cut-off frequency was determined based on the maximum frequency of the males' vocalizations in our presence annotations. Notably, the grunts of the males became indiscernible beyond 1000 Hz in Sonic Visualizer, with occasional louder calls detectable beyond this point, prompting the low pass filter's cut-off to be set at 1000 Hz.

Subsequently, downsampling was performed on each audio file to enhance computational efficiency, given that higher frequencies were deemed excessive for the analysis. Sampling rates beyond 4000 Hz were found to offer no advantage in network performance and only increased computational demands. Therefore, the Nyquist rate was set at 4000 Hz, and the downsampling rate at twice that value (Dufourq et al., 2022).

Further, annotations for both classes were extracted using a sliding window approach. This involved segmenting each audio file into four-second windows initiated at the bounding box's start time, subsequently generating spectrograms that were then labeled accordingly (either presence or absence). This sliding window process continued until the window overlapped with the bounding box's end time, encompassing all bounding boxes across the entire dataset.

Lastly, the audio segments were transformed into two-dimensional mel-frequency spectrograms. The specific values associated with this transformation are outlined in Table 2.

During our training process, we evaluated various pre-trained models detailed in  Dufourq et al. (2022), and among them, the ResNet152V2 model (He et al., 2016) exhibited the best performance. We fine-tuned both the feature extractor and the output layer to achieve the best possible model, although this adjustment led to increased computation time, it enhanced the network's performance. Similar to the approach outlined in Dufourq et al. (2022) and recognizing the necessity for pre-trained models to have a three-channel input (often aligning with the three channels in a colour image), we implemented the exponent method as described in the study. Specifically, we utilized the S1, S3, and S5 channels to generate our three spectrogram channels in line with the methodology outlined above.

To determine a spectrograms classification, the CNNs predicted two softmax outputs on each spectrogram within an entire testing file. The final classification (presence or absence) was determined according to the softmax output surpassing a value of 0.5.  Each file was predicted by using a four second sliding window approach. The window shifted one second at a time until an entire recording was processed.

After training the preliminary CNN, we used it to collect vocalizations from the 2019 recordings, and verified the presence annotations manually to ensure accuracy and eliminate false positives. We also included absence annotations to provide the network with additional training data. We then used a subset of our 2020 annotations and all annotations from our 2019 recordings to train a final CNN. This final model was trained on a dataset of 10,778 presence and 11,546 absence annotations over 25 epochs. The final CNN was then used to detect reindeer vocalizations for the entire dataset. The number of files for each individual age group were: 1.5-year-old, 11 files, 2.5-year-old, 58 files, , 3.5-year-old 1, 22 files, 3.5-year-old 2, 53 files, 3.5-year-old 3, 21 files, 4.5-year-old, 28 files, and 5.5-year-old, 93 files each file was eight hours long). It's worth noting that some files were less than 8 hours due to recorder failure during the recording.

To evaluate the network's performance, we conducted tests using 12 audio files, none of which were used during the network's training phase. These files represented individuals from across the rut and various rutting behaviours. For instance, some files contained minimal vocalizations, while others featured over 250 instances. Among these recordings, eight were from the 2020 data, while the remaining four were from the 2019 data. To assess the network's performance, we used recall rate, precision, accuracy and F1 scores  (Mesaros et al., 2016; Navarro et al., 2017; Equation A1-4).

To train the networks, we utilized the packages listed in Table A1. The script was executed using Python 3, and the CNNs were implemented in Tensorflow 2 (Abadi et al., 2016). Each CNN underwent training across epochs employing the Adam optimizer with a batch size of 32 (Kingma & Ba, 2014). Subsequently, spectrograms were generated using the Librosa library (McFee et al., 2020). Model training and testing took place on a 2021 Apple MacBook Pro equipped with an Apple M1 Pro processor and 16 GB of LPDDR5 RAM, operating on MacOS Ventura 13.1.

Files

0.5-Sample-Training-and-predicting_example.zip

Files (2.0 GB)

Name Size Download all
md5:14e2ee30068c7edc2bbc1d24e8f6ffa8
644.9 MB Preview Download
md5:f8bbeacf1a2cee66bdf56d2c6e002b8d
651.4 MB Preview Download
md5:1bb558da67d30dcfcb5c4afb7a383e16
651.4 MB Preview Download
md5:4e1fd5f0188894aee7093ca68a53de0f
12.1 kB Download
md5:0e7f3de56e99d82dc329c3b27fc9ce8b
21.6 MB Preview Download

Additional details

Related works

Is source of
10.5061/dryad.w6m905qx8 (DOI)