Published June 10, 2025 | Version v1
Dataset Open

DCASE 2024 Task 7 Dataset - Open Source

  • 1. ROR icon Laboratoire des Sciences du Numérique de Nantes
  • 2. Gaudio Lab Inc.
  • 3. ROR icon Carnegie Mellon University
  • 4. Gaudio Lab inc.
  • 5. New York University
  • 6. ROR icon Doshisha University
  • 7. ROR icon Ritsumeikan University

Description

This dataset supports the development and evaluation of prompt-based generative algorithms for environmental sound synthesis. It is designed for the Sound Scene Synthesis task, which consists of generating realistic environmental sound scenes from textual descriptions.

The dataset is a free and open version of the one used in the DCASE 2024 Task 7 challenge on sound scene synthesis. For a full description of the task and access to challenge results, please consult the official challenge page. An in-depth description of the challenge evaluation protocol and a detailed analysis of the results are available in [1].

Unlike the official challenge dataset, this version includes only audio sourced from Freesound and excludes any proprietary or private sound libraries.

📊 Dataset Overview

The dataset includes 310 audio clips, each 4 seconds long, along with their corresponding text prompts. Unlike typical audio captioning datasets, both the prompts and audio scenes were manually crafted and edited. This enables a more controlled and quantifiable evaluation of generative models.

Prompts follow a fixed structure:

> (foreground sound source)  with (background sound source)  in the background

Foreground sounds are action-based (e.g., a dog barking). They fall into six categories:

- animal
- vehicle
- human
- alarm
- tool
- entrance

These are paired with five possible background categories:

- crowd
- traffic
- water
- birds
- no background

> Note: Foreground vehicle sounds are not paired with the traffic background to avoid redundancy. The no background category enables the evaluation of monophonic scenes with isolated foreground sources.

The dataset is split into a developpement and an evaluation set:

- Development Set:  60 audio–caption pairs (backgrounds: crowd, traffic, water)
- Evaluation Set:  250 audio–caption pairs (backgrounds: crowd, traffic, water, birds, no background)

📁 Folder Structure

Inside the DCASE-TASK7-2024-Open-Source/ folder:

DCASE-TASK7-2024-Open-Source/
├── dev/
│ ├── audio/
│ └── caption.csv
├── eval/
│ ├── audio/
│ └── caption.csv


- audio/: Contains the audios in wav format.
- caption.csv: Provides corresponding prompts for each audio file.

📎 Citation

If you use this dataset in your research, please cite it as:

Tailleur, Modan; Lee, Junwon; Heller, Laurie; Choi, Keunwoo; McFee, Brian; Lagrange, Mathieu; Imoto, Keisuke; Okamoto, Yuki.   
DCASE 2024 Task 7 Dataset - Open Source.  Zenodo, 2024. DOI: 10.5281/zenodo.15630417

@misc{dcase2024task7opensource,
  title = {DCASE 2024 Task 7 Dataset - Open Source},
  author = {Tailleur, Modan and Lee, Junwon and Heller, Laurie and Choi, Keunwoo and McFee, Brian and Lagrange, Mathieu and Imoto, Keisuke and Okamoto, Yuki},
  year = {2024},
  publisher = {Zenodo},
  doi = {10.5281/zenodo.15630417}
}

📚 References

[1] Lee, Junwon; Tailleur, Modan; Heller, Laurie M.; Choi, Keunwoo; Lagrange, Mathieu; McFee, Brian; Imoto, Keisuke; Okamoto, Yuki.  
Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation. In Audio Imagination: NeurIPS 2024 Workshop on AI-Driven Speech, Music, and Sound Generation, 2024.

Files

DCASE-TASK7-2024-Open-Source.zip

Files (146.8 MB)

Name Size Download all
md5:f7413cf80f644ea0e3e84b76060877ac
146.8 MB Preview Download