DCASE 2024 Task 9: Language-Queried Audio Source Separation | Validation Set

Liu, Xubo; Wang, Wenwu; Plumbley, Mark D.; Le Roux, Jonathan; Wichern, Gordon; Zhao, Yan; Liu, Yuzhuo; Chen, Hangting

doi:10.5281/zenodo.10886481

Published March 27, 2024 | Version v1

Dataset Open

DCASE 2024 Task 9: Language-Queried Audio Source Separation | Validation Set

1. University of Surrey
2. Mitsubishi Electric Research Laboratories
3. ByteDance
4. Tencent AI Lab

This is the validation set for Task 9, Language-Queried Audio Source Separation (LASS), in DCASE 2024 Challenge.

This validation split is meant to be used for Task 9 at the scientific challenge DCASE 2024. This split is not meant to be used for training LASS methods. This split is meant to be used for evaluating LASS methods during the model development stage.

This validation set consists of 1000 audio files sourced from Freesound [1], uploaded between April and October 2023. Each audio file has been manually annotated with three captions. In the annotation guidance, we instructed annotators to describe the content of audio clips using 5-20 words (similar to the caption style in Clotho [3] and AudioCaps [4] datasets). The tags of each audio file were verified and revised according to the FSD50K [2] sound event categories. Each audio file has been chunked into a 10-second clip and downsampled to 16kHz.

== Details ==

The audio files in the archives:

lass_validation.zip

and the associated metadata (including tags and captions) in the JSON file:

lass_validation.json

Participants will evaluate their LASS models using synthetic mixture data in the development stage. Specifically, given an audio clip A1 and its corresponding caption C, we select an additional audio clip, A2, to serve as background noise, thereby creating a mixed audio, A3. We anticipate that the LASS system, given A3 and C as inputs, will be able to separate the A1 source. We use the revised tags information to ensure that the two audio clips used in each mix do not share overlapping sound source classes. Three thousand synthetic audio mixtures with signal-to-noise ratios (SNR) ranging from -15dB to 15dB will be generated for the validation of LASS model development. These synthetic mixtures can be generated based on the provided CSV file:

lass_synthetic_validation.csv

The evaluation tool can be found at: https://github.com/Audio-AGI/dcase2024_task9_baseline/blob/main/dcase_evaluator.py

== References ==

[1] Fonseca E, Pons Puig J, Favory X, et al. Freesound datasets: a platform for the creation of open audio datasets. International Society for Music Information Retrieval (ISMIR), 2017.

[2] Fonseca E, Favory X, Pons J, et al. FSD50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 829-852.

[3] Drossos K, Lipping S, Virtanen T. Clotho: An audio captioning dataset. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020: 736-740.

[4] Kim C D, Kim B, Lee H, et al. AudioCaps: Generating captions for audios in the wild. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2019: 119-132.

Files

lass_synthetic_validation.csv

Files (258.1 MB)

Name	Size	Download all
lass_synthetic_validation.csv md5:d37a04e926d177a0a503a79eebcf5040	333.9 kB	Preview Download
lass_validation.json md5:5e7eb246c1941e0968010c0e61c5c353	493.1 kB	Preview Download
lass_validation.zip md5:3cef0ba9a06607b169112675042ed6be	257.3 MB	Preview Download

	All versions	This version
Views	536	536
Downloads	789	789
Data volume	49.2 GB	49.2 GB

DCASE 2024 Task 9: Language-Queried Audio Source Separation | Validation Set

Creators

Description

Files

lass_synthetic_validation.csv

Files (258.1 MB)