Published May 22, 2022 | Version 1.0
Dataset Open

Urban Sound & Sight (Urbansas) - Labeled set

  • 1. New York University
  • 2. Universitat Pompeu Fabra
  • 3. Universidad de la República
  • 4. Bosch Research


Urban Sound & Sight (Urbansas): 

Version 1.0, May 2022

Created by
Magdalena Fuentes (1, 2), Bea Steers (1, 2), Pablo Zinemanas (3), Martín Rocamora (4), Luca Bondi (5), Julia Wilkins (1, 2), Qianyi Shi (2), Yao Hou (2), Samarjit Das (5), Xavier Serra (3), Juan Pablo Bello (1, 2)
1. Music and Audio Research Lab, New York University
2. Center for Urban Science and Progress, New York University
3. Universitat Pompeu Fabra, Barcelona, Spain
4. Universidad de la República, Montevideo, Uruguay
5. Bosch Research, Pittsburgh, PA, USA


If using this data in academic work, please cite the following paper, which presented this dataset:
M. Fuentes, B. Steers, P. Zinemanas, M. Rocamora, L. Bondi, J. Wilkins, Q. Shi, Y. Hou, S. Das, X. Serra, J. Bello. “Urban Sound & Sight: Dataset and Benchmark for Audio-Visual Urban Scene Understanding”. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.


Urbansas is a dataset for the development and evaluation of machine listening systems for audiovisual spatial urban understanding. One of the main challenges to this field of study is a lack of realistic, labeled data to train and evaluate models on their ability to localize using a combination of audio and video.
We set four main goals for creating this dataset: 
1. To compile a set of real-field audio-visual recordings;
2. The recordings should be stereo to allow exploring sound localization in the wild;
3. The compilation should be varied in terms of scenes and recording conditions to be meaningful for training and evaluation of machine learning models;
4. The labeled collection should be accompanied by a bigger unlabeled collection with similar characteristics to allow exploring self-supervised learning in urban contexts.
Audiovisual data
We have compiled and manually annotated Urbansas from two publicly available datasets, plus the addition of unreleased material. The public datasets are the TAU Urban Audio-Visual Scenes 2021 Development dataset (street-traffic subset) and the Montevideo Audio-Visual Dataset (MAVD):

Wang, Shanshan, et al. "A curated dataset of urban scenes for audio-visual scene analysis." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.

Zinemanas, Pablo, Pablo Cancela, and Martín Rocamora. "MAVD: A dataset for sound event detection in urban environments." Detection and Classification of Acoustic Scenes and Events, DCASE 2019, New York, NY, USA, 25–26 oct, page 263--267 (2019).

The TAU dataset consists of 10-second segments of audio and video from different scenes across European cities, traffic being one of the scenes. Only the scenes labeled as traffic were included in Urbansas. MAVD is an audio-visual traffic dataset curated in different locations of Montevideo, Uruguay, with annotations of vehicles and vehicle components sounds (e.g. engine, brakes) for sound event detection. Besides the published datasets, we include a total of 9.5 hours of unpublished material recorded in Montevideo, with the same recording devices of MAVD but including new locations and scenes.

Recordings for TAU were acquired using a GoPro Hero 5 (30fps, 1280x720) and a Soundman OKM II Klassik/studio A3 electret binaural in-ear microphone with a Zoom F8 audio recorder (48kHz, 24 bits, stereo). Recordings for MAVD were collected using a GoPro Hero 3 (24fps, 1920x1080) and a SONY PCM-D50 recorder (48kHz, 24 bits, stereo). 

When compiled in Urbansas, it includes 15 hours of stereo audio and video, stored in separate 10 second MPEG4 (1280x720, 24fps) and WAV (48kHz, 24 bit, 2 channel) files. Both released video datasets are already anonymized to obscure people and license plates, the unpublished MAVD data was anonymized similarly using this anonymizer. We also distribute the 2fps video used for producing the annotations.

The audio and video files both share the same filename stem, meaning that they can be associated after removing the parent directory and extension.



where location_id in both cases includes the city and an ID number.

      city &  places &  clips &  mins &  frames &  labeled mins    \\
Montevideo &       8 &   4085 &   681 &  980400 &            92 \\
 Stockholm &       3 &     91 &    15 &   21840 &             2 \\
 Barcelona &       4 &    144 &    24 &   34560 &            24 \\
  Helsinki &       4 &    144 &    24 &   34560 &            16 \\
    Lisbon &       4 &    144 &    24 &   34560 &            19 \\
      Lyon &       4 &    144 &    24 &   34560 &             6 \\
     Paris &       4 &    144 &    24 &   34560 &             2 \\
    Prague &       4 &    144 &    24 &   34560 &             2 \\
    Vienna &       4 &    144 &    24 &   34560 &             6 \\
    London &       5 &    144 &    24 &   34560 &             4 \\
     Milan &       6 &    144 &    24 &   34560 &             6 \\
     Total &      50 &   5472 &   912 & 1.3M &           180 \\


Of the 15 hours of audio and video, 3 hours of data (1.5 hours TAU, 1.5 hours MAVD) are manually annotated by our team both in audio and image, along with 12 hours of unlabeled data (2.5 hours TAU, 9.5 hours of unpublished material) for the benefit of unsupervised models. The distribution of clips across locations was selected to maximize variance across different scenes. The annotations were collected at 2 frames per second (FPS) as it provided a balance between temporal granularity and clip coverage.

The annotation data is contained in video_annotations.csv and audio_annotations.csv. 

Video Annotations

Each row in the video annotations represents a single object in a single frame of the video. The annotation schema is as follows:

  • frame_id: The index of the frame within the clip the annotation is associated with. This index is 0-based and goes up to 19 (assuming 10-second clips with annotations at 2 FPS)
  • track_id: The ID of the detected instance that identifies the same object across different frames. These IDs are guaranteed to be unique within a clip.
  • x, y, w, h: The top-left corner and width and height of the object’s bounding box in the video. The values are given in absolute coordinates with respect to the image size (1280x720). 
  • class_id: The index of the class corresponding to: [0, 1, 2, 3, -1] — see label for the index mapping. The -1 value corresponds to the case where there are no events, but still clip-level annotations, like night and city. When operating on bounding boxes, class_id of -1 should be filtered.
  • label: The label text. This is equivalent to LABELS[class_id], where LABELS=[car, bus, motorbike, truck, -1]. The label -1 has the same role as above.
  • visibility: The visibility of the object. This is 1 unless the object becomes obstructed, where it changes to 0.
  • filename: The file ID of the associated file. This is the file’s path minus the parent directory and extension.
  • city: The city where the clip was collected in.
  • location_id: The specific name of the location. This may include an integer ID following the city name for cases where there are multiple collection points.
  • time: The time (in seconds) of the annotation, relative to the start of the file. Equivalent to frame_id / fps .
  • night: Whether the clip takes place during the day or at night. This value is singular per clip.
  • subset: Which data source the data originally belongs to (TAU or MAVD).

Audio Annotations

Each row represents a single object instance, along with the time range that it exists within the clip. The annotation schema is as follows:

  • filename: The file ID odd the associated audio file. See filename above. 
  • class_id, label: See above. Audio has an additional class_id of 4 (label=offscreen) which indicates an off-screen vehicle - meaning a vehicle that is heard but not seen. A class_id of -1 indicates a clip-level annotation for a clip that has no object annotations (an empty scene).
  • non_identifiable_vehicle_sound: True if the region contains the sound of vehicles where individual instances cannot be uniquely identified. 
  • start, end: The start and end times (in seconds) of the annotation relative to the file. 

Conditions of use

Dataset created by Magdalena Fuentes, Bea Steers, Pablo Zinemanas, Martín Rocamora, Luca Bondi, Julia Wilkins, Qianyi Shi, Yao Hou, Samarjit Das, Xavier Serra, and Juan Pablo Bello.

The Urbansas dataset is offered free of charge under the following terms:

  • Urbansas annotations are release under the CC BY 4.0 license
  • Urbansas video and audio replicates the original sources licenses:
    •    MAVD subset is released under  CC BY 4.0 
    •    TAU subset is released under a Non-Commercial license


Please help us improve Urbansas by sending your feedback to:

  • Magdalena Fuentes:
  • Bea Steers: 

In case of a problem, please include as many details as possible.


This work was partially supported by the National Science Foundation award 1955357 and Bosch RTC.


Files (9.7 GB)

Name Size Download all
731.1 kB Preview Download
2.7 GB Preview Download
3.6 kB Preview Download
9.5 kB Preview Download
1.4 GB Preview Download
1.7 GB Preview Download
1.5 GB Preview Download
2.3 GB Preview Download