Published July 13, 2021 | Version 1
Dataset Open

Sentinel-2 KappaZeta Cloud and Cloud Shadow Masks

  • 1. KappaZeta Ltd, Tartu, Estonia; Institute of Computer Science, University of Tartu , Tartu, Estonia;
  • 2. KappaZeta Ltd, Tartu, Estonia; Tartu Observatory
  • 3. KappaZeta Ltd, Tartu, Estonia


General information

The dataset consists of 4403 labelled subscenes from 155 Sentinel-2 (S2) Level-1C (L1C) products distributed over the Northern European terrestrial area. Each S2 product was oversampled at 10 m resolution for 512 x 512 pixels subscenes. 6 L1C S2 products were labelled fully. Among other 149 S2 products the most challenging ~10 subscenes per product were selected for labelling. In total the dataset represents 4403 labelled Sentinel-2 subscenes, where each sub-tile is 512 x 512 pixels at 10 m resolution. The dataset consists of around 30 S2 products per month from April to August and 3 S2 products per month for September and October. Each selected L1C S2 product represents different clouds, such as cumulus, stratus, or cirrus, which are spread over various geographical locations in Northern Europe.

The classification pixel-wise map consists of the following categories:

  • 0 – MISSING: missing or invalid pixels;
  • 1 – CLEAR: pixels without clouds or cloud shadows;
  • 2 – CLOUD SHADOW: pixels with cloud shadows;
  • 3 – SEMI TRANSPARENT CLOUD: pixels with thin clouds through which the land is visible; include cirrus clouds that are on the high cloud level (5-15km).
  • 4 – CLOUD: pixels with cloud; include stratus and cumulus clouds that are on the low cloud level (from 0-0.2km to 2km).
  • 5 – UNDEFINED: pixels that the labeler is not sure which class they belong to.

The dataset was labelled using Computer Vision Annotation Tool (CVAT) and With the possibility of integrating active learning process in, the labelling was performed semi-automatically.

The dataset limitations must be considered: the data is covering only terrestrial region and does not include water areas; the dataset is not presented in winter conditions; the dataset represent summer conditions, therefore September and October contain only test products used for validation. Current subscenes do not have georeferencing, however, we are working towards including them in next version.

More details about the dataset structure can be found in README. 

Contributions and Acknowledgements

The data were annotated by Fariha Harun and Olga Wold. The data verification and Software Development was performed by Indrek Sünter, Heido Trofimov, Anton Kostiukhin, Marharyta Domnich, Mihkel Järveoja, Olga Wold. Methodology was developed by Kaupo Voormansik, Indrek Sünter, Marharyta Domnich.
We would like to thank annotation tool for instant and an individual customer support. We are grateful to European Space Agency for reviews and suggestions. We would like to extend our thanks to Prof. Gholamreza Anbarjafari for the feedback and directions.
The project was funded by European Space Agency, Contract No. 4000132124/20/I-DT.



Files (27.1 GB)

Name Size Download all
347.2 kB Preview Download
27.1 GB Preview Download