AI4SNOW SnowGalileo Datasets and Model Checkpoints
Authors/Creators
Description
This repository contains the training and evaluation datasets, as well as the trained model checkpoints for the ESA AI4Snow (AI4Science 4000143295/23/I-DT) model "SnowGalileo". SnowGalileo is a pre-trained transformer model, fine-tuned for daily fractional snow cover (FSC) mapping at 100 m resolution.
SnowGalileo's processing comprises three stages: (1) pre-training, where the model is trained in a self-supervised manner using unlabeled multi-source Earth observation (EO) data; (2) fine-tuning, where the model is trained and validated using EO data and FSC label pairs in a supervised manner; (3) evaluation, where the model is tested using EO data and FSC label pairs in a supervised manner. Pre-training and fine-tuning data points were collected between 2020 and 2024 and are distributed globally across mountain ranges in the Northern Hemisphere. Evaluation data were collected between 2020 and 2023 and are concentrated geographically on the Canadian Rockies and the Swiss Alps. The fine-tuning data is further split into “train” and “test” subfolders, corresponding to an 80/20 random machine learning validation split. To provide independent evaluation conditions that the model has not seen during training, the Canadian Rockies and Swiss Alps regions are excluded from the fine-tuning dataset.
Input Data
Comprises the following files:
Pre-training:
-
"pretrain_inputs_h5pys.tar.xz"
Fine-tuning:
Evaluation:
-
“evaluate_rockies_inputs_h5pys.tar.xz”
-
“evaluate_rockies_inputs_with_clouds_h5pys.tar.xz”
-
“evaluate_switzerland_inputs_h5pys.tar.xz”
-
“evaluate_switzerland_inputs_with_clouds_h5pys.tar.xz”
All input data originate from Google Earth Engine (Gorelick et al., 2017).
SnowGalileo uses stacked multi-sensor Earth observation imagery and auxiliary data collected from a 1 km x 1 km ground area over an 8-day period to generate FSC predictions. The products are partly upsampled/downsampled to match the shape of the modality groups described below, which enables processing with SnowGalileo.
The naming convention for all pre-training files is as follows:
'min_lat=[MIN_LAT]_min_lon=[MIN_LON]_max_lat=[MAX_LAT]_max_lon=[MAX_LON]_season=[SEASON]_dates=[DATE_RANGE].h5'
, where latitudes and longitudes specify the bounds of the file, and date range the time range of the time series.
The naming convention for all fine-tuning and evaluation files is as follows:
'[ORIG_SENSOR]_[DATE]_[FSC]_[LATITUDE]_[LONGITUDE].h5'
, where 'ORIG_SENSOR' refers to the sensor from which the FSC values are derived (e.g. LC08 for Landsat-8 (OLI/TIRS)) and 'FSC' indicates the mean FSC value of the corresponding label file.
Each h5py file stores different datasets distinguished by their key. In the following, we refer to top-of-the-atmosphere (TOA), surface reflectance (SR), height (H), width (W), timestep (T), and channels (C). Time series cover 8 days leading up to and including the prediction data, ordered earliest to latest.
The dataset in each h5py file are:
-
s_t_h_x = sensor data of approx. 10 - 30 m spatial resolution. Data points have a shape of (100, 100, 8, 15) in the format (H, W, T, C).
-
Channels include: Sentinel-1 VV, VH, incidence angle + Sentinel-2 (TOA) B2, B3, B4, B8, B11, B12 + Landsat (TOA) B2-B7.
-
s_t_m_x = sensor data of 300 m spatial resolution. Data points have a shape of (5, 5, 8, 2) in the format (H, W, T, C).
-
Channels include: Sentinel-3 (TOA) Oa17, Oa21.
-
s_t_l_x = sensor data of approx. shape 500 m spatial resolution. Data points have a shape of (2, 2, 8, 11) in the format (H, W, T, C).
-
Channels include: MODIS (SR) B1-B7, VIIRS (SR) I1, I3, NDSI, NDVI.
-
sp_x = data of approx. 10 m spatial resolution that are constant across time. Data points have a shape of (100, 100, 14) in the format (H, W, C).
-
Channels include: Copernicus DEM elevation, slope, aspect, one-hot-encoded ESA Worldcover landcover.
-
t_x = Data points have a shape of (8, 9) in the format (T, C).
-
Channels include: VIIRS (SR) M5, M7, M10, M11, ERA5 skin temperature, temperature 2m, total precipitation sum, u component of wind, v component of wind.
-
st_x = Data points have a shape of (3,) in the format (C,).
-
Channels include: x, y, z (cartesian coordinates).
Missing data due to e.g. missing coverage has a placeholder value of -9999 or -10000. However, these values may be incomplete. To reliably mask missing data, validity masks should be used as described below.
Additionally, validity masks for each of the datasets are provided that state whether a pixel is valid (value of 1) or should be masked out in the training process due to missing data (flagged with value of 0). More information on how these masks were created can be found in our code repository.
-
valid_data_mask_s_t_h = validity mask for s_t_h_x
-
valid_data_mask_s_t_m = validity mask for s_t_m_x
-
valid_data_mask_s_t_l = validity mask for s_t_l_x
-
valid_data_mask_sp = validity mask for sp_x
-
valid_data_mask_t = validity mask for t_x
-
valid_data_mask_st = validity mask for st_x
As the spatiotemporal location of the fine-tuning input data is limited by the availability of labels, which are only provided on cloud-free days, we have created a cloudy version of the fine-tuning inputs using the method described by Czerkawski et al. (2023). These inputs are complementary; either the normal version can be used to train the model under clear conditions or the cloudy version can be used under cloudy conditions.
The h5py files provided are pre-processed versions of the GeoTIFF files originally exported from Google Earth Engine. We did not store the original files due to storage size limitations. Retrieving the original files is possible and documented, and can be done using our provided code.
Labels
Comprises the following files:
Fine-tuning:
-
"finetune_labels_tifs.tar.xz"
Evaluation:
SnowGalileo's FSC labels were created from Landsat-8 and Landsat-9 imagery, and the SnowPEx algorithm (Ripper et al., 2019; as described in Koehler et al., 2022).
The naming convention for all label files is as follows:
'[ORIG_SENSOR]_[DATE]_[FSC]_[LATITUDE]_[LONGITUDE].tif'
, where 'ORIG_SENSOR' refers to the sensor from which the FSC values are derived (e.g. LC08 for Landsat-8 (OLI/TIRS)) and 'FSC' indicates the mean FSC value of the respective file.
Each label has a resolution of 100 m and thus, each image has a pixel size of 10 x 10 (1 x 1 km ground area captured at 100 m resolution). Values are in the range of 0 to 1, where 0 corresponds to 0 % FSC and 1 corresponds to 100 % FSC.
All files are provided as compressed tar archives. Once you have downloaded and extracted the datasets, you will need about 837 GB of storage space for them (573 GB for pre-training, 251 GB for fine-tuning, and about 13 GB for evaluation data).
Further information about the creation of the labels, as well as the usage of the files can be found in our accompanying repository. Further information about data sampling, the exact data products used, and processing will be available in the respective paper upon publication.
While the data was created to train and evaluate SnowGalileo, its usage is not bound to the model and we highly encourage practitioners to extend its usage for training and evaluating future AI models.
Checkpoints
This repository also includes all final model checkpoints for SnowGalileo.
"checkpoints_snowgalileo_pretrain/" contains the checkpoints after pre-training on snow-covered mountains and "checkpoints_snowgalileo_finetune/" contains the checkpoints after fine tuning on fractional snow cover. Our code repository will contain more information on how to integrate the checkpoints into the training and evaluation of models.
The SnowGalileo-finetune checkpoints follow the following naming convention:
‘[CONDITION]_[INIT]_[SEED]_[ID].pth’
, where
-
CONDITION is one of “clear” (no clouds + HR in prediction date), “clouds” (clouds + HR in prediction date), “no_hr” (no clouds + no HR in prediction date), or “clouds_no_hr” (clouds + no HR in prediction date). This variable indicates on which condition the model has been fine-tuned (e.g., a “clouds”-model will have been trained on data with generated clouds, while the “clear” model has only seen cloud-free images during training).
-
INIT is either “pretrained” for a pre-trained model or “random” when the model was not pre-trained before the FSC training process.
-
SEED is the seed (we train all models using seed 42, 20, and 90), and ID is a unique run id.
-
For the ablation runs, I added “5000samples” or “10000samples” in the beginning. This means that the model was either fine-tuned on 5000 random samples or 10000 random samples of the actual fine-tuning dataset (the others are fine-tuned on ~23000 samples).
Note: The checkpoints of the trained baseline models (random forest, support vector regressor, MLP) are not included in this repository due to file size limitations. However, if you are interested in them, please contact marlena1@gmx.de.
References
Czerkawski, M., Atkinson, R., Michie, C., & Tachtatzis, C. (2023). Satellitecloudgenerator: controllable cloud and shadow synthesis for multi-spectral optical satellite images. Remote Sensing, 15(17), 4138.
Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment.
Koehler, J., Bauer, A., Dietz, A. J., & Kuenzer, C. (2022). Towards forecasting future snow cover dynamics in the European Alps—The potential of long optical remote-sensing time series. Remote Sensing, 14(18), 4461.
Ripper, E., Schwaizer, G., Nagler, T., Metsämäki, S., Törmä, M., Fernandes, R., Crawford, C. J., Painter, T. H., & Rittger, K. (2019). Guidelines for the generation of snow extent products from high resolution optical sensors (Final Deliverable D8). SnowPEx. https://snowpex.enveo.at/doc/D08_Guidelines_for_the_generation_of_snow_extent_products_from_HR_optical_sensors_FINAL_v2.1.pdf
Files
checkpoints_snowgalileo_finetune.zip
Files
(44.7 GB)
| Name | Size | |
|---|---|---|
|
md5:439a17d7ec670e68be5807ad9066c106
|
777.3 MB | Preview Download |
|
md5:301d25bafc0f41377015edf0684e6fd3
|
88.3 MB | Preview Download |
|
md5:f730d87b20a4f6f993f16f0083421302
|
225.3 MB | Download |
|
md5:ca8a4e6e9c1e76305ab34dd055113d78
|
283.8 MB | Download |
|
md5:fad704d7a2fa6dfd967d88218211e70b
|
147.3 kB | Download |
|
md5:30e111436d13eda27c7134088030cfee
|
261.5 MB | Download |
|
md5:37e3f13dd3d2e14ff80a8f4efe967167
|
356.2 MB | Download |
|
md5:51ce724c02396484cc7fd1e31abf1712
|
184.0 kB | Download |
|
md5:4105a6330cb116a3196ce67915fc5e30
|
7.9 GB | Download |
|
md5:f1f9de2a42e6398a360d1f0778e7ee8b
|
12.2 GB | Download |
|
md5:bc184a287cd1635627afecefd61222b1
|
6.0 MB | Download |
|
md5:360a6b06438ee8b18a15da2ec8e6160d
|
22.6 GB | Download |
Additional details
Funding
- European Space Agency
- AI4Science 4000143295/23/I-DT