Dataset Open Access

Sentinel-2 Cloud Mask Catalogue

Francis, Alistair; Mrziglod, John; Sidiropoulos, Panagiotis; Muller, Jan-Peter

Overview

This dataset comprises cloud masks for 513 1022-by-1022 pixel subscenes, at 20m resolution, sampled random from the 2018 Level-1C Sentinel-2 archive. The design of this dataset follows from some observations about cloud masking: (i) performance over an entire product is highly correlated, thus subscenes provide more value per-pixel than full scenes, (ii) current cloud masking datasets often focus on specific regions, or hand-select the products used, which introduces a bias into the dataset that is not representative of the real-world data, (iii) cloud mask performance appears to be highly correlated to surface type and cloud structure, so testing should include analysis of failure modes in relation to these variables.

The data was annotated semi-automatically, using the IRIS toolkit, which allows users to dynamically train a Random Forest (implemented using LightGBM), speeding up annotations by iteratively improving it's predictions, but preserving the annotator's ability to make final manual changes when needed. This hybrid approach allowed us to process many more masks than would have been possible manually, which we felt was vital in creating a large enough dataset to approximate the statistics of the whole Sentinel-2 archive.

In addition to the pixel-wise, 3 class (CLEAR, CLOUD, CLOUD_SHADOW) segmentation masks, we also provide users with binary
classification "tags" for each subscene that can be used in testing to determine performance in specific circumstances. These include:

  • SURFACE TYPE: 11 categories
  • CLOUD TYPE: 7 categories
  • CLOUD HEIGHT: low, high
  • CLOUD THICKNESS: thin, thick
  • CLOUD EXTENT: isolated, extended

 

Wherever practical, cloud shadows were also annotated, however this was sometimes not possible due to high-relief terrain, or large ambiguities. In total, 424 were marked with shadows (if present), and 89 have shadows that were not annotatable due to very ambiguous shadow boundaries, or terrain that cast significant shadows. If users wish to train an algorithm specifically for cloud shadow masks, we advise them to remove those 89 images for which shadow was not possible, however, bear in mind that this will systematically reduce the difficulty of the shadow class compared to real-world use, as these contain the most difficult shadow examples.

In addition to the 20m sampled subscenes and masks, we also provide users with shapefiles that define the boundary of the mask on the original Sentinel-2 scene. If users wish to retrieve the L1C bands at their original resolutions, they can use these to do so.

Please see the README for further details on the dataset structure and more.

 

Contributions & Acknowledgements

The data were collected, annotated, checked, formatted and published by Alistair Francis and John Mrziglod.

Support and advice was provided by Prof. Jan-Peter Muller and Dr. Panagiotis Sidiropoulos, for which we are grateful.

We would like to extend our thanks to Dr. Pierre-Philippe Mathieu and the rest of the team at ESA PhiLab, who provided the environment in which this project was conceived, and continued to give technical support throughout.

Finally, we thank the ESA Network of Resources for sponsoring this project by providing ICT resources.

 

 

Files (15.4 GB)
Name Size
alt_masks.zip
md5:0140a8b500d85cc8553ec8ba0a304bde
1.1 MB Download
classification_tags.csv
md5:6911e5a8915daf9a98638eb21ba4afd3
77.9 kB Download
masks.zip
md5:c955efe74c52d07f8e8bb02d5143e182
7.2 MB Download
README.pdf
md5:48fb6afa0195a3736d4ce122d007be36
1.9 MB Download
shapefiles.zip
md5:3ed79b74eb84431e68f764457b1f00ac
1.5 MB Download
subscenes.zip
md5:0ad1de0ebeaff529782f456cad2e966f
15.2 GB Download
thumbnails.zip
md5:ac054a7940e0680e768bdd824e0ee8af
145.9 MB Download
1,667
8,639
views
downloads
All versions This version
Views 1,6671,667
Downloads 8,6398,639
Data volume 106.9 TB106.9 TB
Unique views 1,4681,468
Unique downloads 1,8541,854

Share

Cite as