Sample Data for Clustering Protocol Title: A Novel Protocol for Exploratory Analysis of Unknown Sound-Types in Large Acoustic Datasets Updated: April 2025 ------------------------------------------------------------------ Overview This folder provides the R scripts, a sample dataset, and supporting files for running the ecoacoustic clustering protocol described in the manuscript. It includes 30 real WAV files stored in a pre-created "data" folder. The full protocol is split into two parts: Iteration 1 performs an initial round of clustering across all input files. Iteration 2 allows for further refinement of a single cluster, which may be useful if background noise or overlapping sounds are still grouped together. Both iterations include an optional sampling step at the end of the script to select a subset (X%) of files from each cluster for manual verification, where X can be set by the user. ------------------------------------------------------------------ Structure of the Sample Dataset There are four distinct sound-types included in this dataset. One of the clusters was intentionally designed to contain overlapping sound-types in order to demonstrate how the second iteration of clustering can be used to separate files. Each WAV file included in the sample it named after it's corresponding sound-type. Because clustering is an unsupervised method, the cluster number that contains this overlap may differ depending on platform or parameter changes. To identify which cluster requires refinement: 1. Open the cluster-results-.xlsx file generated by Iteration 1. 2. Use the filenames (which include the sound-type in their name) to determine which cluster contains a mix of different sound events. 3. Update the following line in clustering-protocol-iteration-2.R to reflect that cluster number: clustering-protocol-iteration-2.R, L28: cluster_id <- [insert correct cluster number] ------------------------------------------------------------------ File - description data - Folder containing 30 real WAV files for clustering sample_file_names.csv - List of WAV files used in the matrix clustering-protocol-iteration-1.R - Script for initial clustering clustering-protocol-iteration-2.R - Script for optional cluster refinement ------------------------------------------------------------------ Notes on Parameters The protocol uses the Kolmogorov–Smirnov (KS) distance as the beta acoustic index. However, any beta index tailored to ecoacoustic data can be substituted. See Sueur (2018) for further examples of beta acoustic indices and how to use them in R statistical software. Sueur, J. (2018) Sound analysis and synthesis with R. The window length used is 2048, which prioritises frequency resolution. You can change this value in the script based on your needs (e.g., 512 for better time resolution). ------------------------------------------------------------------ Manual Review (Sampling) Each iteration of the protocol includes an optional sampling module at the end of the script. It selects a defined percentage of WAV files per cluster (default 10%, or a minimum number set by the user) to support manual review, validation, or downstream annotation. You may skip this step or adjust the sampling percentage as needed.