Multi-Modal Remote Sensing Object Detection Data

Zhou, Meilun

doi:10.5281/zenodo.19040867

Published March 16, 2026 | Version v1

Dataset Open

Multi-Modal Remote Sensing Object Detection Data

Zhou, Meilun (Contact person)¹

1. University of Florida

This dataset accompanies the paper Learning Annotation-Driven Latent Structure for Multi-Modal Fusion and supports experiments on continuous latent space learning for multimodal and multitask representation learning. The dataset contains labeled examples designed to evaluate how different sensing modalities and annotation structures influence learned representations and downstream task performance.

The data are organized into two benchmark datasets:

AWIR dataset — a multimodal remote sensing dataset with RGB and thermal imagery of different animal species used for multimodal representation learning experiments.
NEON dataset — a multimodal ecological remote sensing dataset derived from the National Ecological Observatory Network (NEON) used for multimodal tree analysis tasks.

Both datasets include annotations designed to support classification, geometric regression, and positional prediction tasks, enabling evaluation of latent spaces under both discrete and continuous supervision.

AWIR Dataset

The AWIR dataset contains paired RGB and thermal image observations of animals collected from aerial imagery. Each image contains a single animal instance annotated with semantic and geometric information derived from the Aerial Wildlife Image Repository (https://scholarsjunction.msstate.edu/gri-publications/2/) and (https://academic.oup.com/database/article/doi/10.1093/database/baae070/7718812?login=false). The dataset was designed to evaluate multimodal representation learning methods that combine appearance and thermal signatures.

Modalities

Each sample includes two sensing modalities:

RGB imagery – standard visible spectrum imagery capturing texture, color, and shape information.
Thermal imagery – long-wave infrared observations capturing heat signatures of animals.

Both modalities are spatially aligned and represent the same scene.

Processing

To create training samples suitable for representation learning, image patches were extracted such that each patch contains exactly one labeled object. For each annotated instance, a crop region was generated around the bounding box corresponding to the animal location. The crop size was fixed to maintain a consistent spatial resolution across samples.

To avoid introducing positional bias, the object was not always centered within the crop. Instead, the bounding box was randomly offset within the crop window while ensuring that the full object remained visible inside the patch. This procedure produces image samples where the object appears at different spatial locations within the image, preventing models from relying on fixed object positioning during training.

Each cropped patch retains the original annotation information, including the class label and bounding box coordinates relative to the crop region. The resulting dataset therefore contains paired RGB and thermal patches, each containing a single object at varying spatial positions, enabling evaluation of both semantic classification and geometric prediction tasks.

This processed version of the AWIR dataset is first introduced in Multi-Task Learning with Multi-Annotation Triplet Loss for Improved Object Detection (https://ieeexplore.ieee.org/abstract/document/11242980).

Annotations

Each sample contains the following annotations:

Class label

Semantic category of the observed animal (e.g., cow, deer, horse).
Bounding box coordinates

Pixel coordinates describing the spatial extent of the animal.
Derived box features

Geometric attributes derived from bounding boxes used for regression tasks.
Position coordinates

Normalized spatial coordinates representing the object center.

NEON Dataset

The NEON dataset is derived from airborne remote sensing data collected by the National Ecological Observatory Network (NEON) and curated in the NeonTreeEvaluation Benchmark (https://zenodo.org/records/5914554). The dataset focuses on tree-level observations extracted from large remote sensing mosaics. Each sample corresponds to a spatial patch centered on a tree stem location.

Modalities

Each tree instance contains three remote sensing modalities:

RGB imagery

High-resolution visible imagery capturing crown structure and texture.
Hyperspectral imagery (HSI)

Multi-band spectral measurements capturing spectral signatures.
LiDAR data

Canopy height model (CHM) derived from three-dimensional structural measurements.

These modalities provide complementary information about vegetation structure and species composition.

Annotations

Each sample contains several annotations:

Tree species label

Discrete label representing the species of the tree.
Bounding box for crown region information

Geometric information describing the spatial extent of the tree canopy.
Tree height measurements

Continuous attributes derived from LiDAR representing canopy height including the min, max, mean, and standard deviation of the CHM within the patch.

These annotations enable both discrete and continuous prediction tasks.

Files

Files (238.7 MB)

Name	Size	Download all
awir_data.npz md5:1fd674be738110465e83e8adf93982ae	129.7 MB	Download
neon_data.npz md5:09a9b2bd3d43e81438e71a83264b2eed	109.1 MB	Download

Additional details

Repository URL: https://github.com/GatorSense/MMCTL
Programming language: Python
Development Status: Active

	All versions	This version
Views	143	143
Downloads	28	28
Data volume	5.0 GB	5.0 GB

Multi-Modal Remote Sensing Object Detection Data

Authors/Creators

Description

Files

Files (238.7 MB)

Additional details

Software