Published May 16, 2025 | Version v1
Thesis Open

Arctic Watershed Sentinel-2 RGB Images and Labels for Image Segmentation using the Segmentation Zoo program

Authors/Creators

  • 1. ROR icon William & Mary

Contributors

  • 1. ROR icon William & Mary

Description

# Arctic Watershed Sentinel-2 RGB Images and Labels for Image Segmentation using the Segmentation Zoo program

## Overview
* Watershed dataset and files for testing the [segmentation gym](https://github.com/Doodleverse/segmentation_gym) program for image segmentation
* Watershed dataset collected by Noah Rupert, Aayla Kastning, and Addison Green
* Dataset consists of a series of imagery throughout the Arctic using the Sentinel-2 satellite.
* Imagery spans the period May-July through the years of 2019-2023.
* Label-image pairs were created by Noah Rupert, W&M, using the labeling program [Doodler](https://github.com/Doodleverse/dash_doodler).

Download and unload the zipped file to replicate results or use for similar purposes. Scripts are available at the personal fork (https://github.com/ncrupert/segmentation_gym), then see the relevant page on the [segmentation gym wiki](https://github.com/Doodleverse/segmentation_gym/wiki) for further explanation.

This dataset and associated models were made by Noah Rupert with instruction and guidance from Joanmarie Del Vecchio for the purposes of determining whether machine learning models are capable of identifying water tracks to the same accuracy as a human observer.

 

## file structure

```{sh}
/Users/Someone/my_segmentation_zoo_datasets
                    │   ├── config
                    │   |    └── *.json
                    │   ├── capehatteras_data
                    |   |   ├── fromDoodler
                    |   |   |     ├──images
                    │   |   |     └──labels
                    |   |   ├──npzForModel
                    │   |   └──toPredict
                    │   └── modelOut
                    │       └── *.png
                    │   └── weights
                    │       └── *.h5

 

There are 3 config files:
1. `/watersheds_test_resunet.json`
2. `/watersheds_test_segformer.json`
3. `/watersheds_test_vanilla_unet.json.json`

 

The first and third files are for res-unet and unet models respectively. They differ with specification of kernel size. The second file is for the SegFormer model. Only the SegFormer model was utilized for our purposes.

The SegFormer model contains the info below as follows:

{
  "TARGET_SIZE": [256,256], # the size of the imagery used to train the model on
  "MODEL": "segformer", # model name
  "NCLASSES": 2, # number of classes
  "BATCH_SIZE": 8, # number of images/labels per batch
  "N_DATA_BANDS": 3, # number of image bands
  "DO_TRAIN": true, # if false, the model will not train, but you will select this config file, data directory, and the program will load the model weights and test the model on the validation subset
  "PATIENCE": 10, # number of epochs of no model improvement before training is aborted
  "MAX_EPOCHS": 100, # maximum number of training epochs
  "VALIDATION_SPLIT": 0.5, #proportion of images to use for validation
  "RAMPUP_EPOCHS": 20, # [Learning Rate-scheduler] rampup to maximim
  "SUSTAIN_EPOCHS": 0.0, # [LR-scheduler] sustain at maximim
  "EXP_DECAY": 0.9, # [LR-scheduler] decay rate
  "START_LR":  1e-7, # [LR-scheduler] starting learning rate
  "MIN_LR": 1e-7, # [LR-scheduler] minimum learning rate. will not go below this rate.
  "MAX_LR": 1e-4, # [LR-scheduler] maximum learning rate. will not exceed this rate.
  "FILTER_VALUE": 0, #if >0, the size of a median filter to apply on outputs (not recommended unless you have noisy outputs)
  "DOPLOT": true, #if true, makes plots
  "ROOT_STRING": "test", #data file (npz) prefix string
  "USEMASK": false, # use the convention 'mask' in label image file names, instead of the preferred 'label'
  "AUG_ROT": 5, # [augmentation] amount of rotation in degrees
   "AUG_ZOOM": 0.05, # [augmentation] amount of zoom as a proportion
  "AUG_WIDTHSHIFT": 0.05, # [augmentation] amount of random width shift as a proportion
  "AUG_HEIGHTSHIFT": 0.05,# [augmentation] amount of random width shift as a proportion
  "AUG_HFLIP": true, #  [augmentation] if true, randomly apply horizontal flips
  "AUG_VFLIP": false, #  [augmentation] if true, randomly apply vertical flips
  "AUG_LOOPS": 10, #[augmentation] number of portions to split the data into (recommended > 2 to save memory)
  "AUG_COPIES": 5  #[augmentation] number iof augmented copies to make
  "SET_GPU": "0" #which GPU to use. If multiple, list separated by a comma, e.g. '0,1,2'. If CPU is requested, use "-1"
 "LOSS_WEIGHTS": false, #if true, apply per-class weights to loss function
 "SET_PCI_BUS_ID": true, #if true, make keras aware of the PCI BUS ID (advanced or nonstandard GPU usage)
 "TESTTIMEAUG": true, #if true, apply test-time augmentation when model in inference mode
 "WRITE_MODELMETADATA": true,# if true, write model metadata per image when model in inference mode
 "OTSU_THRESHOLD": true# if true, and NCLASSES=2 only, use per-image Otsu threshold rather than decision boundary of 0.5 on softmax scores
}
 
 
 

## watersheds_test_run data
Folder containing all the model input data


/modelOut

      | train_data                                                                                                                                                                                                                                                                      |     ├── train_images                                                                                                                                                                                                                                                |     ├── train_labels                                                                                                                                                                                                                                                    |     ├── Train_npzs

      | val_data                                                                                                                                                                                                                                                                      |     ├── val_images                                                                                                                                                                                                                                                    |     ├── val_labels                                                                                                                                                                                                                                                      |     ├── val_npzs

## modelOut
PNG format files containing example model outputs from the train ('_train_' in filename) and validation ('_val_' in filename) subsets as well as an image showing training loss and accuracy curves with `trainhist` in the filename. There are two sets of these files, those associated with the residual unet trained with dice loss contain `resunet` in their name, and those from the UNet are named with `vanilla_unet`.

## weights
There are model weights files associated with each config files.

Files

watersheds_test_run.zip

Files (209.8 MB)

Name Size Download all
md5:761469ef1dc323cb037b78bde6771880
209.8 MB Preview Download

Additional details

Funding

U.S. National Science Foundation
Elements: A workflow for efficient and reproducible permafrost geomorphology analysis #2311319

Dates

Submitted
2025-05-14

Software

Repository URL
https://github.com/ncrupert/segmentation_gym
Programming language
Python