CitySpeechMix: A Simulated Dataset of Speech and Urban Sound Mixtures from LibriSpeech and SONYC-UST

Tailleur, Modan; Lagrange, Mathieu; Aumond, Pierre; Tourre, Vincent

doi:10.5281/zenodo.15405950

Published May 14, 2025 | Version V1.0

Dataset Open

CitySpeechMix: A Simulated Dataset of Speech and Urban Sound Mixtures from LibriSpeech and SONYC-UST

1. Laboratoire des Sciences du Numérique de Nantes
2. Université Gustave Eiffel - Campus de Nantes
3. École Centrale de Nantes

CitySpeechMix is a simulated audio dataset that mixes speech excerpts from LibriSpeech with environmental recordings from SONYC-UST to create controlled mixtures of voice and background noise. Each audio file is accompanied by the corresponding LibriSpeech transcription and the SONYC-UST sound class labels. A mapping is also provided between the selected SONYC-UST sound classes and their corresponding AudioSet categories.

📊 Dataset Overview

The dataset consists of 742 audio clips , each 10 seconds long:
- 371 mixtures of speech over urban background noise
- 371 voice-free urban environmental recordings

🛠️ Dataset Construction

The dataset, included in the `cityspeechmix.zip` archive, is constructed as follows:

- Environmental sounds are selected from the SONYC-UST v2 evaluation set. Only clips annotated with exactly one of the following seven sound classes are retained: `engine`, `jackhammer`, `chainsaw`, `car horn`, `siren`, `music`, and `dog`.
- The resulting SONYC subset is balanced to 742 clips (106 per class, selected randomly when more clips are available). Of these, 371 clips are retained for mixing (sonyc_librispeech_mixtures folder), and 371 clips are peak-normalized and left untouched (sonyc_unmixed_subset) .
- 371 speech clips (approximately 10 seconds each) are randomly selected from the LibriSpeech evaluation set and matched randomly to the 371 SONYC audio files selected for mixing.
- Each pair of SONYC and LibriSpeech clips is resampled to 44.1 kHz and put at the same RMSE . To simulate realistic background noise conditions, the SONYC signal is attenuated by 6 dB prior to mixing.
- The resulting mixtures are peak-normalized.

📁 Folder Structure

Inside the `cityspeechmix/` folder:

- `sonyc_librispeech_mixtures/` — 371 speech + background noise mixtures
- `sonyc_unmixed_subset/` — 371 voice-free environmental recordings

The source stems (individual speech and background files for each mixture) are available separately in `stems.zip`.

📄 Metadata File Description

Each row in `metadata.csv` corresponds to a 10-second audio clip from the CitySpeechMix dataset. The columns are defined as follows:

- `fname` — Filename of the resulting audio file (either a mixture or a reference clip).
- `sonyc_file` — Filename of the SONYC-UST environmental recording used.
- `librispeech_file` — Filename of the LibriSpeech audio sample used in the mixture. This field is `NaN` for voice-free clips.
- `script` — Transcription of the spoken content from the LibriSpeech file. This field is `NaN` for voice-free clips.
- `label1_sonyc` — First SONYC sound class label (e.g., `siren`, `dog`, `engine`) associated with the environmental recording.
- `label1_audioset` — Corresponding AudioSet-compatible label for `label1_sonyc`.
- `label2_sonyc` — Second SONYC label, corresponding to the voice label of SONYC-UST. This field is `NaN` for voice-free clips.
- `label2_audioset` — Corresponding AudioSet-compatible label for `label2_sonyc`. This field is `NaN` for voice-free clips.

🔎 Suggested Applications

- Speech anonymization systems
- Robust automatic speech recognition (ASR)
- Urban sound tagging in presence of voice

📚 Source Datasets

- LibriSpeech
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015).
Librispeech: An ASR corpus based on public domain audio books.
[Paper] • [Dataset]

- SONYC-UST V2
Cartwright, M., Cramer, J., Bello, J. P., McFee, B., Cartwright, M., & Salamon, J. (2020).
SONYC-UST V2: An Urban Sound Tagging Dataset with Spatiotemporal Context.
[Paper] • [Dataset]

📎 Citation

If you use CitySpeechMix in your research, please cite it as:

> CitySpeechMix: A Simulated Dataset of Speech and Urban Sound Mixtures from LibriSpeech and SONYC-UST
> Modan Tailleur, Mathieu Lagrange, Pierre Aumond, Vincent Tourre. 2025.
> Zenodo. https://doi.org/10.5281/zenodo.15405950

@misc{tailleur2025cityspeechmix,
title = {CitySpeechMix: A Dataset of Speech and Urban Sound Mixtures},
author = {Tailleur, Modan and Lagrange, Mathieu and Aumond, Pierre and Tourre, Vincent},
year = 2025,
publisher = {Zenodo},
doi = {10.5281/zenodo.15405950},
url = {https://doi.org/10.5281/zenodo.15405950}
}

Files

cityspeechmix.zip

Files (924.2 MB)

Name	Size	Download all
cityspeechmix.zip md5:4235162ecf38c1e74adbe7d4b77d24b9	597.7 MB	Preview Download
metadata.csv md5:cf66823f151e7314b914a63935c52d0e	129.5 kB	Preview Download
stems.zip md5:c7729e68de4f49d09989bc07dd2aa655	326.4 MB	Preview Download

	All versions	This version
Views	61	61
Downloads	97	97
Data volume	25.8 GB	25.8 GB

CitySpeechMix: A Simulated Dataset of Speech and Urban Sound Mixtures from LibriSpeech and SONYC-UST

Creators

Description

📊 Dataset Overview

🛠️ Dataset Construction

📁 Folder Structure

📄 Metadata File Description

🔎 Suggested Applications

📚 Source Datasets

📎 Citation

Files

cityspeechmix.zip

Files (924.2 MB)