DCASE 2024 Task 9: Language-Queried Audio Source Separation | Validation Set
Creators
Description
This is the validation set for Task 9, Language-Queried Audio Source Separation (LASS), in DCASE 2024 Challenge.
This validation split is meant to be used for Task 9 at the scientific challenge DCASE 2024. This split is not meant to be used for training LASS methods. This split is meant to be used for evaluating LASS methods during the model development stage.
This validation set consists of 1000 audio files sourced from Freesound [1], uploaded between April and October 2023. Each audio file has been manually annotated with three captions. In the annotation guidance, we instructed annotators to describe the content of audio clips using 5-20 words (similar to the caption style in Clotho [3] and AudioCaps [4] datasets). The tags of each audio file were verified and revised according to the FSD50K [2] sound event categories. Each audio file has been chunked into a 10-second clip and downsampled to 16kHz.
== Details ==
The audio files in the archives:
- lass_validation.zip
and the associated metadata (including tags and captions) in the JSON file:
- lass_validation.json
Participants will evaluate their LASS models using synthetic mixture data in the development stage. Specifically, given an audio clip A1 and its corresponding caption C, we select an additional audio clip, A2, to serve as background noise, thereby creating a mixed audio, A3. We anticipate that the LASS system, given A3 and C as inputs, will be able to separate the A1 source. We use the revised tags information to ensure that the two audio clips used in each mix do not share overlapping sound source classes. Three thousand synthetic audio mixtures with signal-to-noise ratios (SNR) ranging from -15dB to 15dB will be generated for the validation of LASS model development. These synthetic mixtures can be generated based on the provided CSV file:
- lass_synthetic_validation.csv
The evaluation tool can be found at: https://github.com/Audio-AGI/dcase2024_task9_baseline/blob/main/dcase_evaluator.py
== References ==
[1] Fonseca E, Pons Puig J, Favory X, et al. Freesound datasets: a platform for the creation of open audio datasets. International Society for Music Information Retrieval (ISMIR), 2017.
[2] Fonseca E, Favory X, Pons J, et al. FSD50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 829-852.
[3] Drossos K, Lipping S, Virtanen T. Clotho: An audio captioning dataset. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020: 736-740.
[4] Kim C D, Kim B, Lee H, et al. AudioCaps: Generating captions for audios in the wild. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2019: 119-132.