CloudSEN12 - a global dataset for semantic understanding of cloud and cloud shadow in Sentinel-2
Creators
- 1. ZGIS Salzburg University
- 2. National University of San Marcos
- 3. Research Group on Artificial Intelligence, Pontifical Catholic University of Peru
- 4. Sub-directorate of Atmospheric and Hydrospheric Sciences, Geophysical Institute of Peru
- 5. Remote Sensing Centre for Earth Systems Research (RSC4Earth)
- 6. Image Processing Laboratory, University of Valencia
Description
Description
CloudSEN12 is a large dataset for cloud semantic understanding that consists of 9880 regions of interest (ROIs). Each ROI has five 5090x5090 meters image patches (IPs) collected on different dates; we manually choose the images to guarantee that each IP inside an ROI matches one of the following cloud cover groups:
- clear (0%)
- low-cloudy (1% - 25%)
- almost clear (25% - 45%)
- mid-cloudy (45% - 65%)
- cloudy (65% >)
An IP is the core unit in CloudSEN12. Each IP contains data from Sentinel-2 optical levels 1C and 2A, Sentinel-1 Synthetic Aperture Radar (SAR), digital elevation model, surface water occurrence, land cover classes, and cloud mask results from eight cutting-edge cloud detection algorithms. Besides, in order to support standard, weakly, and self-/semi-supervised learning procedures, cloudSEN12 includes three distinct forms of hand-crafted labelling data: high-quality, scribble, and no annotation. Consequently, each ROI is randomly assigned to a different annotation group:
-
2000 ROIs with pixel-level annotation, where the average annotation time is 150 minutes (high-quality group).
-
2000 ROIs with scribble level annotation, where the annotation time is 15 minutes (scribble group).
-
5880 ROIs with annotation only in the cloud-free (0\%) image (no annotation group).
For high-quality labels, we use the Intelligence foR Image Segmentation\cite{iris2019} (IRIS) active learning technology, a system that combines human photo-interpretation and machine learning. For scribble, ground truth pixels were drawn using IRIS but without ML support. Finally, the no annotation dataset is generated automatically, with manual annotation only in the clear image patch. The dataset is already available here: https://shorturl.at/cgjtz. Check out our website https://cloudsen12.github.io/ for examples of how to download the dataset via STAC.