Dataset Open Access

Sentinel-2 reference cloud masks generated by an active learning method

Louis Baetens; Olivier Hagolle

 Reference classifications generated with Active Learning for Cloud Detection (ALCD)

This data set provides a reference cloud mask data set for 38 Sentinel-2 scenes. These reference masks have been created with the ALCD tool, developed by Louis Baetens, under the direction of Olivier Hagolle at CESBIO/CNES[1]. They were created to validate the cloud masks generated by the MAJA software [2].

- The `Reference_dataset` directory contains 31 scenes selected in 2017 or 2018.
- The `Hollstein` directory contains 7 scenes that were used to validate the ALCD tool by comparison to manually generated reference images kindlyprovided by Hollstein et al[3]
One of these scenes is present in both directories. For the validation of MAJA, the "Hollstein" scenes were not used because of their acquisition at a time period when Sentinel-2 was not yet operational, with a degraded repetitivity of observations.

# Description of the data structure
The name of each scene directory is the name of the corresponding Sentinel-2 L1C product.
In the scene directory, three sub-directories can be found.
- `Classification`
- `Samples`
- `Statistics`

# Description of the files
- `Classification/classification_map.tif` --- the main product, which is the classified scene. 7 classes are available. Each one is represented with a different integer.
0: no_data.
1: not used.
2: low clouds.
3: high clouds.
4: clouds shadows.
5: land.
6: water.
7: snow.

- `Classification/confidence_enhanced.tif` --- enhanced confidence map of the classification. The values are between 0 and 255 (coded on 1 bit).
The original confidence map is, for each pixel, the proportion of votes for the majority class as the classification map has been created via a Random Forest algorithm.
A median filter has been applied to this confidence map. Finally, the value was saved on 1 bit, leading to the value being between 0 and 255.

- `Classification/contours.png` --- the contours of the classes from the classification map, overlayed on the scene. The color code depends on each class.
Green: low and high clouds. Yellow: cloud shadows. Blue: water. Purple: snow.

- `Classification/used_parameters.json` --- the parameters that were used to classify the scene. It includes the tile code, the cloudy and clear dates, along with their product reference.

- `Samples/` --- this directory contains all the shapefiles, one per class.

- `Statistics/k_fold_summary.json` --- results of the 10-fold cross-validation on the scene.
5 metrics are computed, in the order given in the "metrics_names". "all_metrics" is a list of the 10 folds, with the 5 metrics in the correct order for each fold.
"means" and "stds" are the means and standard deviations of the 10 folds.

# References

[1] Baetens, L.; Desjardins, C.; Hagolle, O. Validation of Copernicus Sentinel-2 Cloud Masks Obtained from MAJA, Sen2Cor, and FMask Processors Using Reference Cloud Masks Generated with a Supervised Active Learning Procedure. Remote Sens. 2019, 11, 433.

[2] A multi-temporal method for cloud detection, applied to FORMOSAT-2, VENµS, LANDSAT and SENTINEL-2 images, O Hagolle, M Huc, D. Villa Pascual, G Dedieu, Remote Sensing of Environment 114 (8), 1747-1755, 2010

[3] Hollstein, A.; Segl, K.; Guanter, L.; Brell, M.; Enesco, M. Ready-to-Use Methods for the Detection of Clouds, Cirrus, Snow, Shadow, Water and Clear Sky Pixels in Sentinel-2 MSI Images. Remote Sens. 2016, 8, 666

Files (234.6 MB)
Name Size
234.6 MB Download
All versions This version
Views 3,6483,647
Downloads 997997
Data volume 233.9 GB233.9 GB
Unique views 3,2423,241
Unique downloads 680680


Cite as