Songdo Vision: Vehicle Annotations from High-Altitude BeV Drone Imagery in a Smart City

Fonod, Robert; Cho, Haechan; Yeo, Hwasoo; Geroliminis, Nikolas

doi:10.5281/zenodo.13828408

Published March 17, 2025 | Version v1

Dataset Open

Songdo Vision: Vehicle Annotations from High-Altitude BeV Drone Imagery in a Smart City

1. École Polytechnique Fédérale de Lausanne
2. Korea Advanced Institute of Science and Technology

Overview

The Songdo Vision dataset provides high-resolution (4K, 3840×2160 pixels) RGB images annotated with axis-aligned bounding boxes for vehicle detection from a high-altitude bird's-eye view (BEV) perspective.

It comprises 5,419 annotated video frames (4,335 training / 1,084 test; 80/20 split) containing 272,435 vehicle instances across four categories:

Car (including vans and light-duty vehicles)
Bus
Truck
Motorcycle

Frames were sampled from footage collected during a multi-drone experiment in Songdo International Business District, South Korea (October 4–7, 2022), using a fleet of 10 DJI Mavic 3 drones flying at 140–150 m altitude (29.97 FPS), primarily covering 20 busy intersections and some roads in between.

Annotations are provided in three widely-supported formats: COCO JSON, YOLO TXT, and Pascal VOC XML. A full dataset card with usage instructions and download guidance is available on Hugging Face.

📌 Citation: If you use this dataset in your work, kindly acknowledge it by citing the following article:

Robert Fonod, Haechan Cho, Hwasoo Yeo, Nikolas Geroliminis (2025). Advanced computer vision for extracting georeferenced vehicle trajectories from drone imagery, Transportation Research Part C: Emerging Technologies, vol. 178, 105205. DOI: 10.1016/j.trc.2025.105205.

🔗 Related dataset: For precisely georeferenced vehicle trajectories extracted from the same large-scale multi-drone experiment, see Songdo Traffic: 10.5281/zenodo.13828384.

Motivation

Publicly available datasets for aerial vehicle detection often exhibit limitations such as:

Non-BeV perspectives with varying angles and distortions
Inconsistent annotation quality, with loose or missing bounding boxes
Lower-resolution imagery, reducing detection accuracy, particularly for smaller vehicles
Lack of annotation detail, especially for motorcycles in dense urban scenes with complex backgrounds

To address these challenges, Songdo Vision provides high-quality human-annotated bounding boxes, with machine learning assistance used to enhance efficiency and consistency. This ensures accurate and reliable ground truth for training and evaluating detection models.

Dataset Composition

The dataset is randomly split into training (80%) and test (20%) subsets:

Subset	Images	Car	Bus	Truck	Motorcycle	Total Vehicles
Train	4,335	195,539	7,030	11,779	2,963	217,311
Test	1,084	49,508	1,759	3,052	805	55,124

A subset of 5,274 frames was randomly sampled from drone video sequences, while an additional 145 frames were carefully selected to represent challenging cases, such as motorcycles at pedestrian crossings, in bicycle lanes, near traffic light poles, and around other distinctive road markers where they may blend into the urban environment.

Data Collection

The dataset was collected as part of a collaborative multi-drone experiment conducted by KAIST and EPFL in Songdo, South Korea, from October 4–7, 2022.

A fleet of 10 drones monitored 20 busy intersections, executing advanced flight plans to optimize coverage.
4K (3840×2160) RGB video footage was recorded at 29.97 FPS from altitudes of 140–150 meters.
Each drone flew 10 sessions per day, covering peak morning and afternoon periods.
The experiment resulted in 12TB of 4K raw video data.

More details on the experimental setup and data processing pipeline are available in [1].

Bounding Box Annotations & Formats

Annotations were generated using a semi-automated object detection annotation process in Azure ML Studio, leveraging machine learning-assisted bounding box detection with human verification to ensure precision.

Each annotated frame includes categorized, axis-aligned bounding boxes, stored in three widely-used formats:

1. COCO JSON format

Single annotation file per dataset subset (i.e., one for training, one for testing).
Contains metadata such as image dimensions, bounding box coordinates, and class labels.
Example snippet:

{
  "images": [{"id": 1, "file_name": "0001.jpg", "width": 3840, "height": 2160}],
  "annotations": [{"id": 1, "image_id": 1, "category_id": 2, "bbox": [500, 600, 200, 50], "area": 10000, "iscrowd": 0}],
  "categories": [
    {"id": 1, "name": "car"}, {"id": 2, "name": "bus"},
    {"id": 3, "name": "truck"}, {"id": 4, "name": "motorcycle"}
  ]
}

2. YOLO TXT format

One annotation file per image, following the format:

<class_id> <x_center> <y_center> <width> <height>

Bounding box values are normalized to [0,1], with the origin at the top-left corner.
Example snippet:

0 0.52 0.63 0.10 0.05  # Car bounding box
2 0.25 0.40 0.15 0.08  # Truck bounding box

3. Pascal VOC XML format

One annotation file per image, structured in XML.
Contains image properties and absolute pixel coordinates for each bounding box.
Example snippet:

<annotation>
  <filename>0001.jpg</filename>
  <size><width>3840</width><height>2160</height><depth>3</depth></size>
  <object>
    <name>car</name>
    <bndbox><xmin>500</xmin><ymin>600</ymin><xmax>600</xmax><ymax>650</ymax></bndbox>
  </object>
</annotation>

File Structure

The dataset is provided as two compressed archives:

1. Training Data (train.zip, 12.91 GB)

train/
│── coco_annotations.json  # COCO format
│── images/
│   ├── 0001.jpg
│   ├── ...
│── labels/
│   ├── 0001.txt  # YOLO format
│   ├── 0001.xml  # Pascal VOC format
│   ├── ...

2. Testing Data (test.zip, 3.22 GB)

test/
│── coco_annotations.json
│── images/
│   ├── 00027.jpg
│   ├── ...
│── labels/
│   ├── 00027.txt
│   ├── 00027.xml
│   ├── ...

Additional Files

README.md – Dataset documentation (this description)
LICENSE.txt – Creative Commons Attribution 4.0 License
names.txt – Class names (one per line)
data.yaml – Example YOLO configuration file for training/testing

Acknowledgments

In addition to the funding sources listed in the metadata, the creators express their gratitude to Artem Vasilev for his dedicated efforts in data annotation. We also thank the research teams of Prof. Simon Oh (Korea University) and Prof. Minju Park (Hannam University) for their assistance during the data collection campaign, including the provision of drone equipment and student support.

Citation & Attribution

Preferred Citation: If you use Songdo Vision for any purpose, whether academic research, commercial applications, open-source projects, or benchmarking efforts, please cite our accompanying article [1]:

Robert Fonod, Haechan Cho, Hwasoo Yeo, Nikolas Geroliminis (2025). Advanced computer vision for extracting georeferenced vehicle trajectories from drone imagery, Transportation Research Part C: Emerging Technologies, vol. 178, 105205. DOI: 10.1016/j.trc.2025.105205

BibTeX entry:
@article{fonod2025advanced,
  title = {Advanced computer vision for extracting georeferenced vehicle trajectories from drone imagery}, 
  author = {Fonod, Robert and Cho, Haechan and Yeo, Hwasoo and Geroliminis, Nikolas},
  journal = {Transportation Research Part C: Emerging Technologies},
  volume = {178},
  pages = {105205},
  year = {2025},
  publisher = {Elsevier},
  doi = {10.1016/j.trc.2025.105205},
  url = {https://doi.org/10.1016/j.trc.2025.105205}
}

Dataset Citation (for archival purposes): Although Zenodo automatically provides a formal citation for this dataset (see below), including citation export in various formats such as BibTeX, we kindly request that you reference the above article as the primary source of this work.

Robert Fonod, Haechan Cho, Hwasoo Yeo, Nikolas Geroliminis (2025). Songdo Vision: Vehicle Annotations from High-Altitude BeV Drone Imagery in a Smart City (v1). Zenodo. DOI: 10.5281/zenodo.13828408.

Files

LICENSE.txt

Files (16.1 GB)

Name	Size
data.yaml md5:7f7a0bafde49260308ee25f6898f78fe	1.7 kB	Download
LICENSE.txt md5:527dc6cad772ccb187d5bfe5af738204	18.7 kB	Preview Download
names.txt md5:5c0695704b8c476390bc730f969c8644	24 Bytes	Preview Download
README.md md5:c45a4e28094d8b2f9fc9002df41d5550	8.0 kB	Preview Download
test.zip md5:815776743546439785d3aba36a0c74f6	3.2 GB	Preview Download
train.zip md5:ae3a8762e728a988593ee51055f7c49d	12.9 GB	Preview Download

Additional details

Board of the Swiss Federal Institutes of Technology
Open Research Data (ORD) Program of the ETH Board
Swiss National Science Foundation
NCCR Automation (phase I) 180545
Innosuisse – Swiss Innovation Agency
CityDronics 101.645 IP-ENG
National Research Foundation of Korea
Grant funded by the Korean government (MSIT) 2022R1A2C1012380

Collected: 2022-10-04/2022-10-07

Collected by a fleet of 10 DJI Mavic 3 drones

	All versions	This version
Views	2,127	2,127
Downloads	1,852	1,852
Data volume	15.8 TB	15.8 TB

Overview

Motivation

Dataset Composition

Data Collection

Bounding Box Annotations & Formats

1. COCO JSON format

2. YOLO TXT format

3. Pascal VOC XML format

File Structure

Additional Files

Acknowledgments

Citation & Attribution

LICENSE.txt

Files (16.1 GB)

Related works

Funding

Dates

Songdo Vision: Vehicle Annotations from High-Altitude BeV Drone Imagery in a Smart City

Authors/Creators

Description

Overview

Motivation

Dataset Composition

Data Collection

Bounding Box Annotations & Formats

1. COCO JSON format

2. YOLO TXT format

3. Pascal VOC XML format

File Structure

Additional Files

Acknowledgments

Citation & Attribution

Files

LICENSE.txt

Files (16.1 GB)

Additional details

Related works

Funding

Dates