Arctic Watershed Sentinel-2 RGB Images and Labels for Image Segmentation using the Segmentation Zoo program

Rupert, Noah

doi:10.5281/zenodo.15446471

Published May 16, 2025 | Version v1

Thesis Open

Arctic Watershed Sentinel-2 RGB Images and Labels for Image Segmentation using the Segmentation Zoo program

Rupert, Noah (Researcher)¹

1. William & Mary

Contributors

Data collector (2):

Supervisor:

Del Vecchio, Joanmarie¹

1. William & Mary

# Arctic Watershed Sentinel-2 RGB Images and Labels for Image Segmentation using the Segmentation Zoo program

## Overview
* Watershed dataset and files for testing the [segmentation gym](https://github.com/Doodleverse/segmentation_gym) program for image segmentation
* Watershed dataset collected by Noah Rupert, Aayla Kastning, and Addison Green
* Dataset consists of a series of imagery throughout the Arctic using the Sentinel-2 satellite.
* Imagery spans the period May-July through the years of 2019-2023.
* Label-image pairs were created by Noah Rupert, W&M, using the labeling program [Doodler](https://github.com/Doodleverse/dash_doodler).

Download and unload the zipped file to replicate results or use for similar purposes. Scripts are available at the personal fork (https://github.com/ncrupert/segmentation_gym), then see the relevant page on the [segmentation gym wiki](https://github.com/Doodleverse/segmentation_gym/wiki) for further explanation.

This dataset and associated models were made by Noah Rupert with instruction and guidance from Joanmarie Del Vecchio for the purposes of determining whether machine learning models are capable of identifying water tracks to the same accuracy as a human observer.

## file structure

```{sh}
/Users/Someone/my_segmentation_zoo_datasets
                    │   ├── config
                    │   |    └── *.json
                    │   ├── capehatteras_data
                    |   |   ├── fromDoodler
                    |   |   |     ├──images
                    │   |   |     └──labels
                    |   |   ├──npzForModel
                    │   |   └──toPredict
                    │   └── modelOut
                    │       └── *.png
                    │   └── weights
                    │       └── *.h5

There are 3 config files:
1. `/watersheds_test_resunet.json`
2. `/watersheds_test_segformer.json`
3. `/watersheds_test_vanilla_unet.json.json`

The first and third files are for res-unet and unet models respectively. They differ with specification of kernel size. The second file is for the SegFormer model. Only the SegFormer model was utilized for our purposes.

The SegFormer model contains the info below as follows:

{

"TARGET_SIZE": [256,256], # the size of the imagery used to train the model on

"MODEL": "segformer", # model name

"NCLASSES": 2, # number of classes

"BATCH_SIZE": 8, # number of images/labels per batch

"N_DATA_BANDS": 3, # number of image bands

"DO_TRAIN": true, # if false, the model will not train, but you will select this config file, data directory, and the program will load the model weights and test the model on the validation subset

"PATIENCE": 10, # number of epochs of no model improvement before training is aborted

"MAX_EPOCHS": 100, # maximum number of training epochs

"VALIDATION_SPLIT": 0.5, #proportion of images to use for validation

"RAMPUP_EPOCHS": 20, # [Learning Rate-scheduler] rampup to maximim

"SUSTAIN_EPOCHS": 0.0, # [LR-scheduler] sustain at maximim

"EXP_DECAY": 0.9, # [LR-scheduler] decay rate

"START_LR": 1e-7, # [LR-scheduler] starting learning rate

"MIN_LR": 1e-7, # [LR-scheduler] minimum learning rate. will not go below this rate.

"MAX_LR": 1e-4, # [LR-scheduler] maximum learning rate. will not exceed this rate.

"FILTER_VALUE": 0, #if >0, the size of a median filter to apply on outputs (not recommended unless you have noisy outputs)

"DOPLOT": true, #if true, makes plots

"ROOT_STRING": "test", #data file (npz) prefix string

"USEMASK": false, # use the convention 'mask' in label image file names, instead of the preferred 'label'

"AUG_ROT": 5, # [augmentation] amount of rotation in degrees

"AUG_ZOOM": 0.05, # [augmentation] amount of zoom as a proportion
"AUG_WIDTHSHIFT": 0.05, # [augmentation] amount of random width shift as a proportion
"AUG_HEIGHTSHIFT": 0.05,# [augmentation] amount of random width shift as a proportion
"AUG_HFLIP": true, # [augmentation] if true, randomly apply horizontal flips
"AUG_VFLIP": false, # [augmentation] if true, randomly apply vertical flips
"AUG_LOOPS": 10, #[augmentation] number of portions to split the data into (recommended > 2 to save memory)
"AUG_COPIES": 5 #[augmentation] number iof augmented copies to make
"SET_GPU": "0" #which GPU to use. If multiple, list separated by a comma, e.g. '0,1,2'. If CPU is requested, use "-1"

"LOSS_WEIGHTS": false, #if true, apply per-class weights to loss function

"SET_PCI_BUS_ID": true, #if true, make keras aware of the PCI BUS ID (advanced or nonstandard GPU usage)

"TESTTIMEAUG": true, #if true, apply test-time augmentation when model in inference mode

"WRITE_MODELMETADATA": true,# if true, write model metadata per image when model in inference mode

"OTSU_THRESHOLD": true# if true, and NCLASSES=2 only, use per-image Otsu threshold rather than decision boundary of 0.5 on softmax scores

}

## watersheds_test_run data
Folder containing all the model input data

/modelOut

| train_data | ├── train_images | ├── train_labels | ├── Train_npzs

| val_data | ├── val_images | ├── val_labels | ├── val_npzs

## modelOut
PNG format files containing example model outputs from the train ('_train_' in filename) and validation ('_val_' in filename) subsets as well as an image showing training loss and accuracy curves with `trainhist` in the filename. There are two sets of these files, those associated with the residual unet trained with dice loss contain `resunet` in their name, and those from the UNet are named with `vanilla_unet`.

## weights
There are model weights files associated with each config files.

Files

watersheds_test_run.zip

Files (209.8 MB)

Name	Size	Download all
watersheds_test_run.zip md5:761469ef1dc323cb037b78bde6771880	209.8 MB	Preview Download

Additional details

U.S. National Science Foundation
Elements: A workflow for efficient and reproducible permafrost geomorphology analysis #2311319

Submitted: 2025-05-14

Repository URL: https://github.com/ncrupert/segmentation_gym
Programming language: Python

	All versions	This version
Views	120	120
Downloads	40	40
Data volume	8.4 GB	8.4 GB

Contributors

Data collector (2):

Supervisor:

watersheds_test_run.zip

Files (209.8 MB)

Funding

Dates

Software

Arctic Watershed Sentinel-2 RGB Images and Labels for Image Segmentation using the Segmentation Zoo program

Authors/Creators

Contributors

Data collector (2):

Supervisor:

Description

Files

watersheds_test_run.zip

Files (209.8 MB)

Additional details

Funding

Dates

Software