Multi-Modal Remote Sensing Object Detection Data
Description
This dataset accompanies the paper Learning Annotation-Driven Latent Structure for Multi-Modal Fusion and supports experiments on continuous latent space learning for multimodal and multitask representation learning. The dataset contains labeled examples designed to evaluate how different sensing modalities and annotation structures influence learned representations and downstream task performance.
The data are organized into two benchmark datasets:
-
AWIR dataset — a multimodal remote sensing dataset with RGB and thermal imagery of different animal species used for multimodal representation learning experiments.
-
NEON dataset — a multimodal ecological remote sensing dataset derived from the National Ecological Observatory Network (NEON) used for multimodal tree analysis tasks.
Both datasets include annotations designed to support classification, geometric regression, and positional prediction tasks, enabling evaluation of latent spaces under both discrete and continuous supervision.
AWIR Dataset
The AWIR dataset contains paired RGB and thermal image observations of animals collected from aerial imagery. Each image contains a single animal instance annotated with semantic and geometric information derived from the Aerial Wildlife Image Repository (https://scholarsjunction.msstate.edu/gri-publications/2/) and (https://academic.oup.com/database/article/doi/10.1093/database/baae070/7718812?login=false). The dataset was designed to evaluate multimodal representation learning methods that combine appearance and thermal signatures.
Modalities
Each sample includes two sensing modalities:
-
RGB imagery – standard visible spectrum imagery capturing texture, color, and shape information.
-
Thermal imagery – long-wave infrared observations capturing heat signatures of animals.
Both modalities are spatially aligned and represent the same scene.
Processing
To create training samples suitable for representation learning, image patches were extracted such that each patch contains exactly one labeled object. For each annotated instance, a crop region was generated around the bounding box corresponding to the animal location. The crop size was fixed to maintain a consistent spatial resolution across samples.
To avoid introducing positional bias, the object was not always centered within the crop. Instead, the bounding box was randomly offset within the crop window while ensuring that the full object remained visible inside the patch. This procedure produces image samples where the object appears at different spatial locations within the image, preventing models from relying on fixed object positioning during training.
Each cropped patch retains the original annotation information, including the class label and bounding box coordinates relative to the crop region. The resulting dataset therefore contains paired RGB and thermal patches, each containing a single object at varying spatial positions, enabling evaluation of both semantic classification and geometric prediction tasks.
This processed version of the AWIR dataset is first introduced in Multi-Task Learning with Multi-Annotation Triplet Loss for Improved Object Detection (https://ieeexplore.ieee.org/abstract/document/11242980).
Annotations
Each sample contains the following annotations:
-
Class label
Semantic category of the observed animal (e.g., cow, deer, horse).
-
Bounding box coordinates
Pixel coordinates describing the spatial extent of the animal.
-
Derived box features
Geometric attributes derived from bounding boxes used for regression tasks.
-
Position coordinates
Normalized spatial coordinates representing the object center.
NEON Dataset
The NEON dataset is derived from airborne remote sensing data collected by the National Ecological Observatory Network (NEON) and curated in the NeonTreeEvaluation Benchmark (https://zenodo.org/records/5914554). The dataset focuses on tree-level observations extracted from large remote sensing mosaics. Each sample corresponds to a spatial patch centered on a tree stem location.
Modalities
Each tree instance contains three remote sensing modalities:
-
RGB imagery
High-resolution visible imagery capturing crown structure and texture.
-
Hyperspectral imagery (HSI)
Multi-band spectral measurements capturing spectral signatures.
-
LiDAR data
Canopy height model (CHM) derived from three-dimensional structural measurements.
These modalities provide complementary information about vegetation structure and species composition.
Annotations
Each sample contains several annotations:
-
Tree species label
Discrete label representing the species of the tree.
-
Bounding box for crown region information
Geometric information describing the spatial extent of the tree canopy.
-
Tree height measurements
Continuous attributes derived from LiDAR representing canopy height including the min, max, mean, and standard deviation of the CHM within the patch.
These annotations enable both discrete and continuous prediction tasks.
Files
Files
(238.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:1fd674be738110465e83e8adf93982ae
|
129.7 MB | Download |
|
md5:09a9b2bd3d43e81438e71a83264b2eed
|
109.1 MB | Download |
Additional details
Software
- Repository URL
- https://github.com/GatorSense/MMCTL
- Programming language
- Python
- Development Status
- Active