Title: MAESTRO Synthetic - Multiple Annotator Estimated STROng labels # MAESTRO Synthetic Machine Listening Group, Tampere University Authors - Irene Martin Morato () - Manu Harju () - Annamaria Mesaros (, ) ## 1. Dataset MAESTRO synthetic contains 20 synthetic audio files created using Scaper, each of them 3 minutes long. The dataset was created for studying annotation procedures for strong labels using crowdsourcing. The audio files contain sounds from the following classes: - car_horn - children_voices - dog_bark - engine_idling - siren - street_music Audio files contain excerpts of recordings uploaded to freesound.org. Please see FREESOUNDCREDITS.txt for an attribution list.  Audio files are generated using Scaper, with small changes to the synthesis procedure: Sounds were placed at random intervals, controlling for a maximum polyphony of 2. Intervals between two consecutive events are selected at random, but limited to 2-10 seconds. Event classes and event instances are chosen uniformly, and mixed with a signal-to-noise ratio (SNR) randomly selected between 0 and 20 dB over a Brownian noise background. Having two overlapping events from the same class is avoided. ### Annotation procedure For annotation, each file of 3 minutes was split into 10-second segments, with a hop of one second. Each segment was annotated using crowdsourcing, in a tagging scenario. For each segment, the annotators were required to select from the given list of classes the sounds active (audible). Each 10-s segment was annotated by five persons. Full details on the annotation procedure and the processing of the tags can be found in: Irene Martin Morato, Manu Harju, and Annamaria Mesaros. Crowdsourcing strong labels for sound event detection, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2021). New Paltz, NY, Oct 2021.  ### Dataset content The dataset contains:  - audio: the 20 synthetic soundscapes, each 3 min long - ground truth:  the "true" reference annotation created using Scaper, in jams (complete) and txt (simplified) format - raw annotations: complete data as annotated by multiple MTurk workers - estimated audio tags: tags per 10s segment, aggregated based on multiple opinions (MACE in the paper) - estimated strong labels: outcome of the method (MACE method in the paper) ### Files correspondence Each 3-minute file was split into 10-s segments with a 1-s hop. For example, scape_00.wav contains the segments 000000.wav - 000170.wav. The correspondence between them is as follows: 000000.wav starts at offset 0, 000001.wav starts at offset 1 s, 000002.wav starts at offset 2 s, etc Full list: - scape_00: 000000.wav - 000170.wav - scape_01: 000171.wav - 000341.wav - scape_02: 000342.wav - 000512.wav - scape_03: 000513.wav - 000683.wav - scape_04: 000684.wav - 000854.wav - scape_05: 000855.wav - 001025.wav - scape_06: 001026.wav - 001196.wav - scape_07: 001197.wav - 001367.wav - scape_08: 001368.wav - 001538.wav - scape_09: 001539.wav - 001709.wav - scape_10: 001710.wav - 001880.wav - scape_11: 001881.wav - 002051.wav - scape_12: 002052.wav - 002222.wav - scape_13: 002223.wav - 002393.wav - scape_14: 002394.wav - 002564.wav - scape_15: 002565.wav - 002735.wav - scape_16: 002736.wav - 002906.wav - scape_17: 002907.wav - 003077.wav - scape_18: 003078.wav - 003248.wav - scape_19: 003249.wav - 003419.wav ### File structure ``` dataset root │ README.md this file │ FREESOUNDCREDITS.txt information on the individual sound examples used in the data │ files_mapping.csv mapping between freesound id and sound instances extracted from them, format file.wav [tab] label [tab] saliency [tab] freesound_id [tab] start_time [tab] end_time │ └───audio │ │ scape00.wav │ │ scape01.wav │ │ ... │ └───estimated_strong_labels outcome of the method (using MACE) │ │ mturk_scape00.csv format: start_time [tab] end_time [tab] label │ │ mturk_scape01.csv │ │ ... │ └───scaper_reference ground truth created with Scaper (annotations, output from Scaper) │ │ scape00.jams │ │ scape00.txt │ │ scape01.jams │ │ scape01.txt │ │ ... └───tags │ │ MAESTRO_full_annotations.yaml complete multi-annotator tags for all 10-s segments │ │ MAESTRO_labels_mace100.csv aggregated tags per segment, based on multiple annotations(using MACE); format: filename [tab]tag1,tag2,.. ``` ## 2. License License permits free academic usage. Any commercial use is strictly prohibited. For commercial use, contact dataset authors. Copyright (c) 2020 Tampere University and its licensors All rights reserved. Permission is hereby granted, without written agreement and without license or royalty fees, to use and copy the MAESTRO Synthetic - Multi Annotator Estimated Strong Labels (“Work”) described in this document and composed of audio and metadata. This grant is only for experimental and non-commercial purposes, provided that the copyright notice in its entirety appear in all copies of this Work, and the original source of this Work, (MAchine Listening Group at Tampere University), is acknowledged in any publication that reports research using this Work. Any commercial use of the Work or any part thereof is strictly prohibited. Commercial use include, but is not limited to: - selling or reproducing the Work - selling or distributing the results or content achieved by use of the Work - providing services by using the Work. IN NO EVENT SHALL TAMPERE UNIVERSITY OR ITS LICENSORS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OF THIS WORK AND ITS DOCUMENTATION, EVEN IF TAMPERE UNIVERSITY OR ITS LICENSORS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. TAMPERE UNIVERSITY AND ALL ITS LICENSORS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE WORK PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND THE TAMPERE UNIVERSITY HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.