AI4SNOW SnowGalileo Datasets and Model Checkpoints

Reil, Marlena; Kaltenborn, Julia; Pelletier, Francis; Rolnick, David; Dietz, Andreas; Baumhoer, Celia

doi:10.5281/zenodo.20735656

Published June 17, 2026 | Version v1

Dataset Open

AI4SNOW SnowGalileo Datasets and Model Checkpoints

1. McGill University
2. Mila - Quebec Artificial Intelligence Institute
3. Deutsches Zentrum für Luft- und Raumfahrt e. V. (DLR)

This repository contains the training and evaluation datasets, as well as the trained model checkpoints for the ESA AI4Snow (AI4Science 4000143295/23/I-DT) model "SnowGalileo". SnowGalileo is a pre-trained transformer model, fine-tuned for daily fractional snow cover (FSC) mapping at 100 m resolution.

SnowGalileo's processing comprises three stages: (1) pre-training, where the model is trained in a self-supervised manner using unlabeled multi-source Earth observation (EO) data; (2) fine-tuning, where the model is trained and validated using EO data and FSC label pairs in a supervised manner; (3) evaluation, where the model is tested using EO data and FSC label pairs in a supervised manner. Pre-training and fine-tuning data points were collected between 2020 and 2024 and are distributed globally across mountain ranges in the Northern Hemisphere. Evaluation data were collected between 2020 and 2023 and are concentrated geographically on the Canadian Rockies and the Swiss Alps. The fine-tuning data is further split into “train” and “test” subfolders, corresponding to an 80/20 random machine learning validation split. To provide independent evaluation conditions that the model has not seen during training, the Canadian Rockies and Swiss Alps regions are excluded from the fine-tuning dataset.

Input Data

Comprises the following files:

Pre-training:

"pretrain_inputs_h5pys.tar.xz"

Fine-tuning:

"finetune_inputs_h5pys.tar.xz"
"finetune_inputs_with_clouds_h5pys.tar.xz"

Evaluation:

“evaluate_rockies_inputs_h5pys.tar.xz”
“evaluate_rockies_inputs_with_clouds_h5pys.tar.xz”
“evaluate_switzerland_inputs_h5pys.tar.xz”
“evaluate_switzerland_inputs_with_clouds_h5pys.tar.xz”

All input data originate from Google Earth Engine (Gorelick et al., 2017).

SnowGalileo uses stacked multi-sensor Earth observation imagery and auxiliary data collected from a 1 km x 1 km ground area over an 8-day period to generate FSC predictions. The products are partly upsampled/downsampled to match the shape of the modality groups described below, which enables processing with SnowGalileo.

The naming convention for all pre-training files is as follows:

'min_lat=[MIN_LAT]_min_lon=[MIN_LON]_max_lat=[MAX_LAT]_max_lon=[MAX_LON]_season=[SEASON]_dates=[DATE_RANGE].h5'

, where latitudes and longitudes specify the bounds of the file, and date range the time range of the time series.

The naming convention for all fine-tuning and evaluation files is as follows:

'[ORIG_SENSOR]_[DATE]_[FSC]_[LATITUDE]_[LONGITUDE].h5'

, where 'ORIG_SENSOR' refers to the sensor from which the FSC values are derived (e.g. LC08 for Landsat-8 (OLI/TIRS)) and 'FSC' indicates the mean FSC value of the corresponding label file.

Each h5py file stores different datasets distinguished by their key. In the following, we refer to top-of-the-atmosphere (TOA), surface reflectance (SR), height (H), width (W), timestep (T), and channels (C). Time series cover 8 days leading up to and including the prediction data, ordered earliest to latest.

The dataset in each h5py file are:

s_t_h_x = sensor data of approx. 10 - 30 m spatial resolution. Data points have a shape of (100, 100, 8, 15) in the format (H, W, T, C).

Channels include: Sentinel-1 VV, VH, incidence angle + Sentinel-2 (TOA) B2, B3, B4, B8, B11, B12 + Landsat (TOA) B2-B7.

s_t_m_x = sensor data of 300 m spatial resolution. Data points have a shape of (5, 5, 8, 2) in the format (H, W, T, C).

Channels include: Sentinel-3 (TOA) Oa17, Oa21.

s_t_l_x = sensor data of approx. shape 500 m spatial resolution. Data points have a shape of (2, 2, 8, 11) in the format (H, W, T, C).

Channels include: MODIS (SR) B1-B7, VIIRS (SR) I1, I3, NDSI, NDVI.

sp_x = data of approx. 10 m spatial resolution that are constant across time. Data points have a shape of (100, 100, 14) in the format (H, W, C).

Channels include: Copernicus DEM elevation, slope, aspect, one-hot-encoded ESA Worldcover landcover.

t_x = Data points have a shape of (8, 9) in the format (T, C).

Channels include: VIIRS (SR) M5, M7, M10, M11, ERA5 skin temperature, temperature 2m, total precipitation sum, u component of wind, v component of wind.

st_x = Data points have a shape of (3,) in the format (C,).

Channels include: x, y, z (cartesian coordinates).

Missing data due to e.g. missing coverage has a placeholder value of -9999 or -10000. However, these values may be incomplete. To reliably mask missing data, validity masks should be used as described below.

Additionally, validity masks for each of the datasets are provided that state whether a pixel is valid (value of 1) or should be masked out in the training process due to missing data (flagged with value of 0). More information on how these masks were created can be found in our code repository.

valid_data_mask_s_t_h = validity mask for s_t_h_x
valid_data_mask_s_t_m = validity mask for s_t_m_x
valid_data_mask_s_t_l = validity mask for s_t_l_x
valid_data_mask_sp = validity mask for sp_x
valid_data_mask_t = validity mask for t_x
valid_data_mask_st = validity mask for st_x

As the spatiotemporal location of the fine-tuning input data is limited by the availability of labels, which are only provided on cloud-free days, we have created a cloudy version of the fine-tuning inputs using the method described by Czerkawski et al. (2023). These inputs are complementary; either the normal version can be used to train the model under clear conditions or the cloudy version can be used under cloudy conditions.

The h5py files provided are pre-processed versions of the GeoTIFF files originally exported from Google Earth Engine. We did not store the original files due to storage size limitations. Retrieving the original files is possible and documented, and can be done using our provided code.

Labels

Comprises the following files:

Fine-tuning:

"finetune_labels_tifs.tar.xz"

Evaluation:

"evaluate_rockies_labels_tifs.tar.xz"
"evaluate_switzerland_labels_tifs.tar.xz"

SnowGalileo's FSC labels were created from Landsat-8 and Landsat-9 imagery, and the SnowPEx algorithm (Ripper et al., 2019; as described in Koehler et al., 2022).

The naming convention for all label files is as follows:

'[ORIG_SENSOR]_[DATE]_[FSC]_[LATITUDE]_[LONGITUDE].tif'

, where 'ORIG_SENSOR' refers to the sensor from which the FSC values are derived (e.g. LC08 for Landsat-8 (OLI/TIRS)) and 'FSC' indicates the mean FSC value of the respective file.

Each label has a resolution of 100 m and thus, each image has a pixel size of 10 x 10 (1 x 1 km ground area captured at 100 m resolution). Values are in the range of 0 to 1, where 0 corresponds to 0 % FSC and 1 corresponds to 100 % FSC.

All files are provided as compressed tar archives. Once you have downloaded and extracted the datasets, you will need about 837 GB of storage space for them (573 GB for pre-training, 251 GB for fine-tuning, and about 13 GB for evaluation data).

Further information about the creation of the labels, as well as the usage of the files can be found in our accompanying repository. Further information about data sampling, the exact data products used, and processing will be available in the respective paper upon publication.

While the data was created to train and evaluate SnowGalileo, its usage is not bound to the model and we highly encourage practitioners to extend its usage for training and evaluating future AI models.

Checkpoints

This repository also includes all final model checkpoints for SnowGalileo.

"checkpoints_snowgalileo_pretrain/" contains the checkpoints after pre-training on snow-covered mountains and "checkpoints_snowgalileo_finetune/" contains the checkpoints after fine tuning on fractional snow cover. Our code repository will contain more information on how to integrate the checkpoints into the training and evaluation of models.

The SnowGalileo-finetune checkpoints follow the following naming convention:

‘[CONDITION]_[INIT]_[SEED]_[ID].pth’

, where

CONDITION is one of “clear” (no clouds + HR in prediction date), “clouds” (clouds + HR in prediction date), “no_hr” (no clouds + no HR in prediction date), or “clouds_no_hr” (clouds + no HR in prediction date). This variable indicates on which condition the model has been fine-tuned (e.g., a “clouds”-model will have been trained on data with generated clouds, while the “clear” model has only seen cloud-free images during training).
INIT is either “pretrained” for a pre-trained model or “random” when the model was not pre-trained before the FSC training process.
SEED is the seed (we train all models using seed 42, 20, and 90), and ID is a unique run id.
For the ablation runs, I added “5000samples” or “10000samples” in the beginning. This means that the model was either fine-tuned on 5000 random samples or 10000 random samples of the actual fine-tuning dataset (the others are fine-tuned on ~23000 samples).

Note: The checkpoints of the trained baseline models (random forest, support vector regressor, MLP) are not included in this repository due to file size limitations. However, if you are interested in them, please contact marlena1@gmx.de.

References

Czerkawski, M., Atkinson, R., Michie, C., & Tachtatzis, C. (2023). Satellitecloudgenerator: controllable cloud and shadow synthesis for multi-spectral optical satellite images. Remote Sensing, 15(17), 4138.

Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017). Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment.

Koehler, J., Bauer, A., Dietz, A. J., & Kuenzer, C. (2022). Towards forecasting future snow cover dynamics in the European Alps—The potential of long optical remote-sensing time series. Remote Sensing, 14(18), 4461.

Ripper, E., Schwaizer, G., Nagler, T., Metsämäki, S., Törmä, M., Fernandes, R., Crawford, C. J., Painter, T. H., & Rittger, K. (2019). Guidelines for the generation of snow extent products from high resolution optical sensors (Final Deliverable D8). SnowPEx. https://snowpex.enveo.at/doc/D08_Guidelines_for_the_generation_of_snow_extent_products_from_HR_optical_sensors_FINAL_v2.1.pdf

Files

checkpoints_snowgalileo_finetune.zip

Files (44.7 GB)

Name	Size
checkpoints_snowgalileo_finetune.zip md5:439a17d7ec670e68be5807ad9066c106	777.3 MB	Preview Download
checkpoints_snowgalileo_pretrain.zip md5:301d25bafc0f41377015edf0684e6fd3	88.3 MB	Preview Download
evaluate_rockies_inputs_h5pys.tar.xz md5:f730d87b20a4f6f993f16f0083421302	225.3 MB	Download
evaluate_rockies_inputs_with_clouds_h5pys.tar.xz md5:ca8a4e6e9c1e76305ab34dd055113d78	283.8 MB	Download
evaluate_rockies_labels_tifs.tar.xz md5:fad704d7a2fa6dfd967d88218211e70b	147.3 kB	Download
evaluate_switzerland_inputs_h5pys.tar.xz md5:30e111436d13eda27c7134088030cfee	261.5 MB	Download
evaluate_switzerland_inputs_with_clouds_h5pys.tar.xz md5:37e3f13dd3d2e14ff80a8f4efe967167	356.2 MB	Download
evaluate_switzerland_labels_tifs.tar.xz md5:51ce724c02396484cc7fd1e31abf1712	184.0 kB	Download
finetune_inputs_h5pys.xz md5:4105a6330cb116a3196ce67915fc5e30	7.9 GB	Download
finetune_inputs_with_clouds_h5pys.tar.xz md5:f1f9de2a42e6398a360d1f0778e7ee8b	12.2 GB	Download
finetune_labels_tifs.xz md5:bc184a287cd1635627afecefd61222b1	6.0 MB	Download
pretrain_inputs_h5pys.tar.xz md5:360a6b06438ee8b18a15da2ec8e6160d	22.6 GB	Download

Additional details

European Space Agency
AI4Science 4000143295/23/I-DT

	All versions	This version
Views	11	11
Downloads	0	0
Data volume	0 Bytes	0 Bytes

checkpoints_snowgalileo_finetune.zip

Files (44.7 GB)

Related works

Funding

AI4SNOW SnowGalileo Datasets and Model Checkpoints

Authors/Creators

Description

Files

checkpoints_snowgalileo_finetune.zip

Files (44.7 GB)

Additional details

Related works

Funding