Published October 17, 2021 | Version 1.0.0
Dataset Open

Synthetic noisy urban soundscapes: a dataset of synthetic soundscapes with real urban backgrounds

  • 1. Northwestern University
  • 2. Mitsubishi Electric Research Laboratory
  • 3. New York University
  • 4. New Jersey Institute of Technology

Description

Publication

 

If you use this data in your work, please cite the following paper, which introduced this dataset:

 

[1] Pishdadian, F., Wichern, G., & Le Roux, J. (2020). Finding strength in weakness: Learning to separate sounds with weak supervision. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP). [pdf]

[2] Cramer, A., Cartwright, M., Pishdadian, F., and Bello, J.P. Weakly Supervised Source-Specific Sound Level Estimation in Noisy Soundscapes. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021. [pdf]


Created by

Fatemeh Pishdadian (1), Gordon Wichern (2), Jonathan Le Roux (2), Aurora Cramer (3, 4), Mark Cartwright (5), and Juan Pablo Bello (3,4,6,7)

    1. Interactive Audio Lab, Northwestern University
    2. Mitsubishi Electric Research Laboratory
    3. Music and Audio Research Lab, New York University
    4. Department of Electrical and Computer Engineering, New York University
    5. Department of Informatics, New Jersey Institute of Technology
    6. Center for Urban Science and Progress, New York University
    7. Department of Computer Science and Engineering, New York University


Description

Synthetic noisy urban soundscapes (SNUSS) is a dataset of synthetic soundscapes with real urban background noise meant to mimic urban soundscapes. This dataset contains 30,000 10 second mixtures, their isolated components, and auto-generated annotations. This dataset was developed with the goal of synthesizing soundscapes with a diverse set of realistic sounding background activity, for use in developing and evaluating machine listening systems in urban settings.


Mixture generation

We generate synthetic mixtures using a collection of isolated sound events with class annotations, as well as a collection of urban background noise.  Audio mixtures are 4 seconds long (at 16kHz). We generate a foreground sub-mixture using a subset of clips from UrbanSound8K [3] from the car horn, dog bark, gun shot, jackhammer and siren classes. These clips range from 0.5 s to 4s. The number of events per mixture is sampled from a zero-truncated Poisson distribution with a rate parameter of 5. The class for each event is chosen uniformly at random from the five target classes. The particular sound event is chosen uniformly at random from the available clips for that class. The start time is chosen uniformly throughout the clip such that the entire clip is contained in the 4 second mixture. A brief fade-in and fade-out is applied to the clip to avoid discontinuities. Each clip is set to a sound level sampled uniformly at random in the range -30 to -20 dB LUFS.

For the background audio, we use urban background recordings from the SONYC-Background dataset [2, 4], containing 441 10 second recordings of urban background noise in New York City. For more information, see the SONYC-Backgrounds page. For each mixture, a random background clip is chosen from which we extract a uniformly chosen 4 second segment.

We create datasets using an foreground-to-background SNRs of -50, -20 -0 dB LUFS, (`n50dB`, `n20dB`, and `0dB` respectively), in addition to a noiseless dataset (`none`). The datasets are generated such that the only difference between them is the relative loudness between the foreground and background.

For the training set, we generate 20,000 mixtures using folds 1-6 of UrbanSound8K and the training set of SONYC-Background. For the validation set, we generate 5,000 examples using folds 7-8 of UrbanSound8K and the validation set of SONYC-Background. For the test set, we generate 5,000 examples using folds 9-10 of UrbanSound8K and the test set of SONYC-Background.

For additional details on the foreground mixture generation process, please refer to [1]. For additional details on generating the soundscapes with background, please refer to [2].


Files

The dataset files are split into the following compressed archives:

  • `synthetic-noisy-urban-soundscapes_mixtures-bkgr-none.tar.gz` - Noiseless mixtures
  • `synthetic-noisy-urban-soundscapes_mixtures-bkgr-n50dB.tar.gz` - Mixtures with -50 dB LUFS SNR
  • `synthetic-noisy-urban-soundscapes_mixtures-bkgr-n20dB.tar.gz` - Mixtures with -20 dB LUFS SNR
  • `synthetic-noisy-urban-soundscapes_mixtures-bkgr-0dB.tar.gz` - Mixtures with 0 dB LUFS SNR
  • `synthetic-noisy-urban-soundscapes_isolated_events.tar.gz` - Isolated sound events for each mixture


Preparing the files

  1. Download each of the tar.gz files to a new folder. You need at the `isolated_events` and one of the mixture datasets.
  2. Decompress all of the tar.gz files.
  3. Merge the contents of the extracted `isolated_events` folder into the extracted mixture folders. This makes sure the corresponding isolated events for each mixture are placed in its `XXXXX_events` folder.


File structure

The mixture dataset folder for the desired background condition should have the format `synthetic-noisy-urban-soundscapes_mixtures-bkgr-<condition>/<split>`. Within each split folder are the mixture files, which are identified by an integer (padded with leading zeros up to 5 places). For a mixture `00001`, the mixture audio is `00001.wav` and the annotation file (in JAMS [5] format) is `00001.jams`. The isolated events can be found in the `00001_events` folder, where foreground events have the format `foreground<fg-event-num>_<class-name>.wav` and the background recording (if used) is called `background0_2017.wav`.


Contact

If you have any questions, comments, or concerns, please direct correspondence to Aurora Cramer (aurora (dot) linh (dot) cramer (at) gmail (dot) com).

 

References

[1] Pishdadian, F., Wichern, G., & Le Roux, J. (2020). Finding strength in weakness: Learning to separate sounds with weak supervision. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP).

[2] Cramer, A., Cartwright, M., Pishdadian, F., and Bello, J. P. (2021). Weakly Supervised Source-Specific Sound Level Estimation in Noisy Soundscapes. In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[3] Salamon, J., Jacoby, C., and Bello, J.P. (2014). A dataset and taxonomy for urban sound research. In 2014 ACM International Conference on Multimedia.

[4] Cramer, A., Cartwright, M., Pishdadian, F., and Bello, J.P. (2021). SONYC-Backgrounds: a collection of urban background recordings from an acoustic sensor network (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.5129078

[5] Humphrey, E. J., Salamon, J., Nieto, O., Forsyth, J., Bittner, R. M., and Bello, J.P. (2014). JAMS: A JSON Annotated Music Specification for Reproducible MIR Research. In 2014 International Society for Music Information Retrieval Conference (ISMIR)


Acknowledgements

This work is partially supported by National Science Foundation award 1633259 and award 1544753.

 

Notes

This work is partially supported by National Science Foundation award 1633259 (https://www.nsf.gov/awardsearch/showAward?AWD_ID=1633259) and award 1544753 (https://www.nsf.gov/awardsearch/showAward?AWD_ID=1544753).

Files

Files (39.5 GB)