Yadav, Sarthak
Foster, Mary Ellen
2021-03-23
<p>GISE-51 is an open dataset of 51 isolated sound events based on the FSD50K dataset. The release also includes the GISE-51-Mixtures subset, a dataset of 5-second soundscapes with up to three sound events synthesized from GISE-51. The GISE-51 release attempts to address some of the shortcomings of recent sound event datasets, providing an open, reproducible benchmark for future research and the freedom to adapt the included isolated sound events for domain-specific applications, which was not possible using existing large-scale weakly labelled datasets. GISE-51 release also included accompanying code for baseline experiments, which can be found at <a href="https://github.com/SarthakYadav/GISE-51-pytorch">https://github.com/SarthakYadav/GISE-51-pytorch.</a></p>
<p><strong>Citation</strong></p>
<p>If you use the GISE-51 dataset and/or the released code, please cite our <a href="https://arxiv.org/abs/2103.12306">paper</a>:</p>
<blockquote>
<p>Sarthak Yadav and Mary Ellen Foster, "GISE-51: A scalable isolated sound events dataset", arXiv:2103.12306, 2021</p>
</blockquote>
<p>Since GISE-51 is based on FSD50K, if you use GISE-51 kindly also cite the <a href="https://arxiv.org/abs/2010.00475">FSD50K paper</a>:</p>
<blockquote>
<p>Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.</p>
</blockquote>
<p><strong>About GISE-51 and GISE-51-Mixtures</strong></p>
<p>The following sections summarize key characteristics of the GISE-51 and the GISE-51-Mixtures datasets, including details left out from the paper.</p>
<p><strong>GISE-51</strong></p>
<ul>
<li>Three subsets: train, val and eval with 12465, 1716, and2176 utterances. Subsets are in coherence with the FSD50K release.</li>
<li>Encompasses 51 sound classes from the FSD50K release</li>
<li>View <code>meta/lbl_map.csv</code> for the complete vocabulary.</li>
<li>The dataset was obtained from FSD50K using the following steps:
<ul>
<li>Unsmearing annotations to obtain single instances with a single label using the provided metadata and ground truth in FSD50K. </li>
<li>Manual inspection to qualitatively evaluate shortlisted utterances. </li>
<li>Volume-threshold based automated silence filtering using <em>sox. </em>Different volume thresholds are selected for various sound event class bins using trial-and-error. <code>silence_thresholds.txt</code> lists class bins and their corresponding volume threshold. Files that were determined by <em>sox</em> to contain no audio at all were manually clipped. Code for performing silence filtering can be found in <code>scripts/strip_silence_sox.py</code> in the code repository.</li>
<li>Re-evaluate sound event classes, removing ones with too few samples and merging those with high inter-class ambiguity.</li>
</ul>
</li>
</ul>
<p><strong>GISE-51-Mixtures</strong></p>
<ul>
<li>Synthetic 5-second soundscapes with up to 3 events created using Scaper.</li>
<li>Weighted sampling with replacement for sound event selection, effectively oversampling events with very few samples. Synthetic soundscapes generated thus have a near equal number of annotations per sound event.</li>
<li>The number of soundscapes in <em>val</em> and <em>eval</em> set is 10000 each.</li>
<li>The number of soundscapes in the final <em>train </em>set is 60000. We do provide training sets with 5k-100k soundscapes.</li>
<li>GISE-51-Mixtures is our proposed subset that can be used to benchmark the performance of future works.</li>
</ul>
<p><strong>LICENSE</strong></p>
<p>All audio clips (i.e., found in isolated_events.tar.gz) used in the preparation of the Glasgow Isolated Events Dataset (GISE-51) are designated Creative Commons and were obtained from FSD50K. The source data in isolated_events.tar.gz is based on the FSD50K dataset, which is licensed as Creative Commons Attribution 4.0 International (CC BY 4.0) License. </p>
<p>GISE-51 dataset (including GISE-51-Mixtures) is a curated, processed and generated preparation, and is released under Creative Commons Attribution 4.0 International (CC BY 4.0) License. The license is specified in the <code>LICENSE-DATASET</code> file in <code>license.tar.gz.</code></p>
<p><strong>Baselines</strong></p>
<p>Several sound event recognition experiments were conducted, establishing baseline performance on several prominent convolutional neural network architectures. The experiments are described in Section 4 of our paper, and the implementation for reproducing these experiments is available at <a href="https://github.com/SarthakYadav/GISE-51-pytorch">https://github.com/SarthakYadav/GISE-51-pytorch</a>.</p>
<p><strong>Files</strong></p>
<p>GISE-51 is available as a collection of several tar archives. All audio files are PCM 16 bit, 22050 Hz. Following lists the contents of these files in detail:</p>
<ul>
<li><code>isolated_events.tar.gz</code>: The core GISE-51 isolated events dataset containing <em>train, val and eval </em>subfolders.</li>
<li><code>meta.tar.gz</code>: contains <code>lbl_map.json</code></li>
<li><code>noises.tar.gz</code>: contains background noises used for GISE-51-Mixtures soundscape generation</li>
<li><code>mixtures_jams.tar.gz</code>: This file contains annotation files in <code>.jams</code> format that, alongside <code>isolated_events.tar.gz</code> and <code>noises.tar.gz</code> can be reused to generate exact GISE-51-Mixtures soundscapes. (Optional, we provide the complete set of GISE-51-Mixtures soundscapes as independent tar archives.)</li>
<li><code>train.tar.gz</code>: GISE-51-Mixtures train set, containing 60k synthetic soundscapes.</li>
<li><code>val.tar.gz</code>: GISE-51-Mixtures val set, containing 10k synthetic soundscapes.</li>
<li><code>eval.tar.gz</code>: GISE-51-Mixtures eval set, containing 10k synthetic soundscapes.</li>
<li><code>train_*.tar.gz</code>: These are tar archives containing training mixtures of a various number of soundscapes, used primarily in Section 4.1 of the paper, which compares <em>val </em>mAP performance v/s number of training soundscapes. A helper script is provided in the code release, <code>prepare_mixtures_lmdb.sh, </code>to prepare data for experiments in Section 4.1.</li>
<li><code>pretrained-models.tar.gz</code>: Contains model checkpoints for all experiments conducted in the paper. More information on these checkpoints can be found in the code release README.
<ul>
<li><em>experiments_60k_mixtures</em>: model checkpoints from section 4.2 of the paper.</li>
<li><em>exported_weights_60k</em>: ResNet-18 and EfficientNet-B1 exported as plain <code>state_dicts</code> for use with transfer learning experiments.</li>
<li><em>experiments_audioset</em>: checkpoints from AudioSet Balanced (Sec 4.3.1) experiments</li>
<li><em>experiments_vggsound</em>: checkpoints from Section 4.3.2 of the paper</li>
<li><em>experiments_esc50</em>: ESC-50 dataset checkpoints, from Section 4.3.3</li>
</ul>
</li>
<li><code>license.tar.gz</code>: contains dataset license info.</li>
<li><code>silence_thresholds.txt</code>: contains volume thresholds for various sound event bins used for silence filtering.</li>
</ul>
<p><strong>Contact</strong></p>
<p>In case of queries and clarifications, feel free to contact Sarthak at <a href="http://mailto:s.yadav.2@research.gla.ac.uk">s.yadav.2@research.gla.ac.uk</a>. (Adding [GISE-51] to the subject of the email would be appreciated!)</p>
https://doi.org/10.5281/zenodo.4593514
oai:zenodo.org:4593514
eng
Zenodo
https://doi.org/10.5281/zenodo.4593513
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
audio dataset
sound event recognition
GISE-51
info:eu-repo/semantics/other