Dataset Open Access

Francis, Alistair; Mrziglod, John; Sidiropoulos, Panagiotis; Muller, Jan-Peter

### Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:creator>Francis, Alistair</dc:creator>
<dc:creator>Mrziglod, John</dc:creator>
<dc:creator>Sidiropoulos, Panagiotis</dc:creator>
<dc:creator>Muller, Jan-Peter</dc:creator>
<dc:date>2020-11-01</dc:date>
<dc:description>Overview

This dataset comprises cloud masks for 513 1022-by-1022 pixel subscenes, at 20m resolution, sampled random from the 2018 Level-1C Sentinel-2 archive. The design of this dataset follows from some observations about cloud masking: (i) performance over an entire product is highly correlated, thus subscenes provide more value per-pixel than full scenes, (ii) current cloud masking datasets often focus on specific regions, or hand-select the products used, which introduces a bias into the dataset that is not representative of the real-world data, (iii) cloud mask performance appears to be highly correlated to surface type and cloud structure, so testing should include analysis of failure modes in relation to these variables.

The data was annotated semi-automatically, using the IRIS toolkit, which allows users to dynamically train a Random Forest (implemented using LightGBM), speeding up annotations by iteratively improving it's predictions, but preserving the annotator's ability to make final manual changes when needed. This hybrid approach allowed us to process many more masks than would have been possible manually, which we felt was vital in creating a large enough dataset to approximate the statistics of the whole Sentinel-2 archive.

In addition to the pixel-wise, 3 class (CLEAR, CLOUD, CLOUD_SHADOW) segmentation masks, we also provide users with binary
classification "tags" for each subscene that can be used in testing to determine performance in specific circumstances. These include:

SURFACE TYPE: 11 categories
CLOUD TYPE: 7 categories
CLOUD HEIGHT: low, high
CLOUD THICKNESS: thin, thick
CLOUD EXTENT: isolated, extended

In addition to the 20m sampled subscenes and masks, we also provide users with shapefiles that define the boundary of the mask on the original Sentinel-2 scene. If users wish to retrieve the L1C bands at their original resolutions, they can use these to do so.

Please see the README for further details on the dataset structure and more.

Contributions &amp; Acknowledgements

The data were collected, annotated, checked, formatted and published by Alistair Francis and John Mrziglod.

Support and advice was provided by Prof. Jan-Peter Muller and Dr. Panagiotis Sidiropoulos, for which we are grateful.

We would like to extend our thanks to Dr. Pierre-Philippe Mathieu and the rest of the team at ESA PhiLab, who provided the environment in which this project was conceived, and continued to give technical support throughout.

Finally, we thank the ESA Network of Resources for sponsoring this project by providing ICT resources.

</dc:description>
<dc:identifier>https://zenodo.org/record/4172871</dc:identifier>
<dc:identifier>10.5281/zenodo.4172871</dc:identifier>
<dc:identifier>oai:zenodo.org:4172871</dc:identifier>
<dc:language>eng</dc:language>
<dc:relation>info:eu-repo/grantAgreement/RCUK/STFC/1912521/</dc:relation>
<dc:relation>doi:10.5281/zenodo.4172870</dc:relation>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:subject>Earth Observation</dc:subject>
<dc:subject>Remote sensing</dc:subject>
<dc:subject>AI4EO</dc:subject>
<dc:subject>Machine Learning</dc:subject>
<dc:subject>Deep Learning</dc:subject>
<dc:subject>satellite</dc:subject>
<dc:subject>meteorology</dc:subject>
<dc:subject>atmosphere</dc:subject>
<dc:subject>atmospheric</dc:subject>
<dc:subject>optical</dc:subject>
<dc:subject>computer vision</dc:subject>
<dc:subject>image segmentation</dc:subject>
<dc:subject>validation</dc:subject>
<dc:subject>training</dc:subject>
<dc:subject>copernicus</dc:subject>
<dc:subject>infrared</dc:subject>
<dc:subject>multispectral</dc:subject>
<dc:subject>spaceborne</dc:subject>
<dc:subject>Sentinel-2</dc:subject>
<dc:subject>iris</dc:subject>
<dc:type>info:eu-repo/semantics/other</dc:type>
<dc:type>dataset</dc:type>
</oai_dc:dc>

1,898
8,898
views