Published May 14, 2025 | Version V1.0
Dataset Open

CitySpeechMix: A Simulated Dataset of Speech and Urban Sound Mixtures from LibriSpeech and SONYC-UST

  • 1. ROR icon Laboratoire des Sciences du Numérique de Nantes
  • 2. Université Gustave Eiffel - Campus de Nantes
  • 3. ROR icon École Centrale de Nantes

Description

CitySpeechMix is a simulated audio dataset that mixes speech excerpts from LibriSpeech with environmental recordings from SONYC-UST to create controlled mixtures of voice and background noise. Each audio file is accompanied by the corresponding LibriSpeech transcription and the SONYC-UST sound class labels. A mapping is also provided between the selected SONYC-UST sound classes and their corresponding AudioSet categories.

📊 Dataset Overview

The dataset consists of 742 audio clips , each 10 seconds long:
- 371 mixtures  of speech over urban background noise
- 371 voice-free  urban environmental recordings

🛠️ Dataset Construction

The dataset, included in the `cityspeechmix.zip` archive, is constructed as follows:

- Environmental sounds are selected from the SONYC-UST v2 evaluation set. Only clips annotated with exactly one of the following seven sound classes are retained: `engine`, `jackhammer`, `chainsaw`, `car horn`, `siren`, `music`, and `dog`.
- The resulting SONYC subset is balanced to 742 clips (106 per class, selected randomly when more clips are available). Of these, 371 clips are retained for mixing (sonyc_librispeech_mixtures folder), and 371 clips are peak-normalized and left untouched (sonyc_unmixed_subset) .
- 371 speech clips (approximately 10 seconds each) are randomly selected from the LibriSpeech evaluation set and matched randomly to the 371 SONYC audio files selected for mixing.
- Each pair of SONYC and LibriSpeech clips is resampled to 44.1 kHz  and put at the same RMSE . To simulate realistic background noise conditions, the SONYC signal is attenuated by 6 dB prior to mixing.
- The resulting mixtures are peak-normalized.

📁 Folder Structure

Inside the `cityspeechmix/` folder:

- `sonyc_librispeech_mixtures/` — 371 speech + background noise mixtures  
- `sonyc_unmixed_subset/` — 371 voice-free environmental recordings  

The source stems (individual speech and background files for each mixture) are available separately in `stems.zip`.

📄 Metadata File Description

Each row in `metadata.csv` corresponds to a 10-second audio clip from the CitySpeechMix dataset. The columns are defined as follows:

- `fname` — Filename of the resulting audio file (either a mixture or a reference clip).
- `sonyc_file` — Filename of the SONYC-UST environmental recording used.
- `librispeech_file` — Filename of the LibriSpeech audio sample used in the mixture. This field is `NaN` for voice-free clips.
- `script` — Transcription of the spoken content from the LibriSpeech file. This field is `NaN` for voice-free clips.
- `label1_sonyc` — First SONYC sound class label (e.g., `siren`, `dog`, `engine`) associated with the environmental recording.
- `label1_audioset` — Corresponding AudioSet-compatible label for `label1_sonyc`.
- `label2_sonyc` — Second SONYC label, corresponding to the voice label of SONYC-UST. This field is `NaN` for voice-free clips.
- `label2_audioset` — Corresponding AudioSet-compatible label for `label2_sonyc`. This field is `NaN` for voice-free clips.

🔎 Suggested Applications

- Speech anonymization systems
- Robust automatic speech recognition (ASR)
- Urban sound tagging in presence of voice

📚 Source Datasets

- LibriSpeech  
  Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015).  
  Librispeech: An ASR corpus based on public domain audio books.  
  [Paper] • [Dataset]

- SONYC-UST V2  
  Cartwright, M., Cramer, J., Bello, J. P., McFee, B., Cartwright, M., & Salamon, J. (2020).  
  SONYC-UST V2: An Urban Sound Tagging Dataset with Spatiotemporal Context.  
  [Paper] • [Dataset]

📎 Citation

If you use CitySpeechMix in your research, please cite it as:

> CitySpeechMix: A Simulated Dataset of Speech and Urban Sound Mixtures from LibriSpeech and SONYC-UST  
> Modan Tailleur, Mathieu Lagrange, Pierre Aumond, Vincent Tourre. 2025.  
> Zenodo. https://doi.org/10.5281/zenodo.15405950

@misc{tailleur2025cityspeechmix,
  title        = {CitySpeechMix: A Dataset of Speech and Urban Sound Mixtures},
  author       = {Tailleur, Modan and Lagrange, Mathieu and Aumond, Pierre and Tourre, Vincent},
  year         = 2025,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.15405950},
  url          = {https://doi.org/10.5281/zenodo.15405950}
}

Files

cityspeechmix.zip

Files (924.2 MB)

Name Size Download all
md5:4235162ecf38c1e74adbe7d4b77d24b9
597.7 MB Preview Download
md5:cf66823f151e7314b914a63935c52d0e
129.5 kB Preview Download
md5:c7729e68de4f49d09989bc07dd2aa655
326.4 MB Preview Download