CitySpeechMix: A Simulated Dataset of Speech and Urban Sound Mixtures from LibriSpeech and SONYC-UST
Description
CitySpeechMix is a simulated audio dataset that mixes speech excerpts from LibriSpeech with environmental recordings from SONYC-UST to create controlled mixtures of voice and background noise. Each audio file is accompanied by the corresponding LibriSpeech transcription and the SONYC-UST sound class labels. A mapping is also provided between the selected SONYC-UST sound classes and their corresponding AudioSet categories.
📊 Dataset Overview
The dataset consists of 742 audio clips , each 10 seconds long:
- 371 mixtures of speech over urban background noise
- 371 voice-free urban environmental recordings
🛠️ Dataset Construction
The dataset, included in the `cityspeechmix.zip` archive, is constructed as follows:
- Environmental sounds are selected from the SONYC-UST v2 evaluation set. Only clips annotated with exactly one of the following seven sound classes are retained: `engine`, `jackhammer`, `chainsaw`, `car horn`, `siren`, `music`, and `dog`.
- The resulting SONYC subset is balanced to 742 clips (106 per class, selected randomly when more clips are available). Of these, 371 clips are retained for mixing (sonyc_librispeech_mixtures folder), and 371 clips are peak-normalized and left untouched (sonyc_unmixed_subset) .
- 371 speech clips (approximately 10 seconds each) are randomly selected from the LibriSpeech evaluation set and matched randomly to the 371 SONYC audio files selected for mixing.
- Each pair of SONYC and LibriSpeech clips is resampled to 44.1 kHz and put at the same RMSE . To simulate realistic background noise conditions, the SONYC signal is attenuated by 6 dB prior to mixing.
- The resulting mixtures are peak-normalized.
📁 Folder Structure
Inside the `cityspeechmix/` folder:
- `sonyc_librispeech_mixtures/` — 371 speech + background noise mixtures
- `sonyc_unmixed_subset/` — 371 voice-free environmental recordings
The source stems (individual speech and background files for each mixture) are available separately in `stems.zip`.
📄 Metadata File Description
Each row in `metadata.csv` corresponds to a 10-second audio clip from the CitySpeechMix dataset. The columns are defined as follows:
- `fname` — Filename of the resulting audio file (either a mixture or a reference clip).
- `sonyc_file` — Filename of the SONYC-UST environmental recording used.
- `librispeech_file` — Filename of the LibriSpeech audio sample used in the mixture. This field is `NaN` for voice-free clips.
- `script` — Transcription of the spoken content from the LibriSpeech file. This field is `NaN` for voice-free clips.
- `label1_sonyc` — First SONYC sound class label (e.g., `siren`, `dog`, `engine`) associated with the environmental recording.
- `label1_audioset` — Corresponding AudioSet-compatible label for `label1_sonyc`.
- `label2_sonyc` — Second SONYC label, corresponding to the voice label of SONYC-UST. This field is `NaN` for voice-free clips.
- `label2_audioset` — Corresponding AudioSet-compatible label for `label2_sonyc`. This field is `NaN` for voice-free clips.
🔎 Suggested Applications
- Speech anonymization systems
- Robust automatic speech recognition (ASR)
- Urban sound tagging in presence of voice
📚 Source Datasets
- LibriSpeech
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015).
Librispeech: An ASR corpus based on public domain audio books.
[Paper] • [Dataset]
- SONYC-UST V2
Cartwright, M., Cramer, J., Bello, J. P., McFee, B., Cartwright, M., & Salamon, J. (2020).
SONYC-UST V2: An Urban Sound Tagging Dataset with Spatiotemporal Context.
[Paper] • [Dataset]
📎 Citation
If you use CitySpeechMix in your research, please cite it as:
> CitySpeechMix: A Simulated Dataset of Speech and Urban Sound Mixtures from LibriSpeech and SONYC-UST
> Modan Tailleur, Mathieu Lagrange, Pierre Aumond, Vincent Tourre. 2025.
> Zenodo. https://doi.org/10.5281/zenodo.15405950
@misc{tailleur2025cityspeechmix,
title = {CitySpeechMix: A Dataset of Speech and Urban Sound Mixtures},
author = {Tailleur, Modan and Lagrange, Mathieu and Aumond, Pierre and Tourre, Vincent},
year = 2025,
publisher = {Zenodo},
doi = {10.5281/zenodo.15405950},
url = {https://doi.org/10.5281/zenodo.15405950}
}