FSD-MIX-CLIPS

Yu Wang; Nicholas J. Bryan; Justin Salamon; Mark Cartwright; Juan Pablo Bello

doi:10.5281/zenodo.5574135

Published October 16, 2021 | Version 0.1.0

Dataset Open

FSD-MIX-CLIPS

1. New York University
2. Adobe Research
3. New Jersey Institute of Technology

Created by

Yu Wang, Nicholas J. Bryan, Justin Salamon, Mark Cartwright, and Juan Pablo Bello

Publication

If using this data in academic work, please cite the following paper, which presented this dataset:

Y. Wang, N. J. Bryan, J. Salamon, M. Cartwright, and J. P. Bello. "Who calls the shots? Rethinking Few-shot Learning for Audio", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021

Description

FSD-MIX-CLIPS is an open dataset of programmatically mixed audio clips with a controlled level of polyphony and signal-to-noise ratio. We use single-labeled clips from FSD50K as the source material for the foreground sound events and Brownian noise as the background to generate 281,039 10-second strongly-labeled soundscapes with Scaper. We refer this (intermediate) dataset of 10s soundscapes as FSD-MIX-SED. Each soundscape contains n events from n different sound classes where n is ranging from 1 to 5. We then extract 614,533 1s clips centered on each sound event in the soundscapes in FSD-MIX-SED to produce FSD-MIX-CLIPS.

Source material and annotations

Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce FSD-MIX-SED using Scaper with the script in the project repository.

All clips in FSD-MIX-CLIPS are extracted from FSD-MIX-SED. Therefore, for FSD-MIX-CLIPS, instead of releasing duplicated audio content, we provide annotations that specify the filename in FSD-MIX-SED and the corresponding starting time (in second) of each 1-second clip.

Foreground material from FSD50K

We choose clips shorter than 4s that have a single validated label with the Present and Predominant annotation type. We further trim the silence at the edges of each clip. The resulting subset contains clips each with a single and strong label. The 200 sound classes in FSD50K are hierarchically organized. We focus on the leaf nodes and rule out classes with less than 20 single-labeled clips. This gives us 89 sound classes. vocab.json contains the list of 89 classes, each class is then labeled by its index in the list.

Data splits

FSD-MIX-CLIPS is originally generated for the task of multi-label audio classification under a few-sho continual learning setup. Therefore, the classes are split into disjoint sets of base and novel classes where novel class data are only used at inference time. We partition the 89 classes into three splits: base, novel-val, and novel-test with 59, 15, and 15 classes, respectively. Base class data are used for both training and evaluation while novel-val/novel-test class data are used for validation/test only.

Files

FSD_MIX_SED.source.tar.gz contains the background Brownian noise and 10,296 of single-labeled sound events from FSD50K in `.wav` format. The original file size is 1.9GB.
FSD_MIX_SED.annotations.tar.gz contains 281,039 JAMS files. The original file size is 35GB.
FSD_MIX_CLIPS.annotations.tar.gz contains ground truth labels for 1-second clips in each data split in FSD_MIX_SED, specified by filename and starting time (sec).
vocab.json contains the 89 classes.

Foreground sound materials and soundscape annotations in FSD_MIX_SED are organized in a similar folder structure following the data splits:

root folder
│   
└───base/                                Base classes (label 0-58)
│   │   
│   └─── train/                             
│   │    │        
│   │    └─── audio or annotation files  
│   │     
│   └─── val/                            
│   │    │        
│   │    └─── audio or annotation files                                
│   │   
│   └─── test/                           
│        │        
│        └─── audio or annotation files 
│
│
└───val/                                 Novel-val classes (label 59-73)
│   │            
│   └─── audio or annotation files        
│  
│   
└───test/                                Novel-test classes (label 74-88)
    │            
    └─── audio or annotation files

References

[1] Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.

Files

vocab.json

Files (1.0 GB)

Name	Size	Download all
FSD_MIX_CLIPS.annotations.tar.gz md5:f713b372a3666c34467e651bd56b6db0	4.3 MB	Download
FSD_MIX_SED.annotations.tar.gz md5:b3813d63a5b2c851dc32fc24e3ad6386	91.7 MB	Download
FSD_MIX_SED.source.tar.gz md5:0d771b20ec867800386f4d5d2cbd77dd	948.4 MB	Download
vocab.json md5:3c708340c3d0c48034b37164210ae1af	1.4 kB	Preview Download

	All versions	This version
Views	1,045	1,041
Downloads	181	180
Data volume	90.7 GB	90.7 GB

FSD-MIX-CLIPS

Creators

Description

Files

vocab.json

Files (1.0 GB)