Dataset for marine vessel detection from Sentinel 2 images in the Finnish coast

Mäyrä, Janne; Jokinen, Ari-Pekka

doi:10.5281/zenodo.10046342

Published May 13, 2024 | Version v1

Dataset Open

Dataset for marine vessel detection from Sentinel 2 images in the Finnish coast

1. Finnish Environment Institute

Contributors

Rights holder:

Finnish Environment Institute

This dataset contains annotated marine vessels from 15 different Sentinel-2 product, used for training object detection models for marine vessel detection. The vessels are annotated as bounding boxes, covering also some amount of the wake if present.

Source data

Individual products used to generate annotations are shown in the following table:

Location	Product name
Archipelago sea	S2B_MSIL1C_20220619T100029_N0400_R122_T34VEM_20220619T104419
	S2A_MSIL1C_20220721T095041_N0400_R079_T34VEM_20220721T115325
	S2A_MSIL1C_20220813T095601_N0400_R122_T34VEM_20220813T120233
Gulf of Finland	S2B_MSIL1C_20220606T095029_N0400_R079_T35VLG_20220606T105944
	S2B_MSIL1C_20220626T095039_N0400_R079_T35VLG_20220626T10432
	S2A_MSIL1C_20220721T095041_N0400_R079_T35VLG_20220721T115325
Bothnian Bay	S2A_MSIL1C_20220627T100611_N0400_R022_T34WFT_20220627T134958
	S2B_MSIL1C_20220712T100559_N0400_R022_T34WFT_20220712T121613
	S2B_MSIL1C_20220828T095549_N0400_R122_T34WFT_20220828T104748
Bothnian Sea	S2B_MSIL1C_20210714T100029_N0301_R122_T34VEN_20210714T121056
	S2A_MSIL1C_20220624T100041_N0400_R122_T34VEN_20220624T120211
	S2A_MSIL1C_20220813T095601_N0400_R122_T34VEN_20220813T120233
Kvarken	S2A_MSIL1C_20220617T100611_N0400_R022_T34VER_20220617T135008
	S2B_MSIL1C_20220712T100559_N0400_R022_T34VER_20220712T121613
	S2A_MSIL1C_20220826T100611_N0400_R022_T34VER_20220826T135136

Even though the reference data IDs are for L1C products, L2A products from the same acquisition dates can be used along with the annotations. However, Sen2Cor has been known to produce incorrect reflectance values for water bodies.

The raw products can be acquired from Copernicus Data Space Ecosystem.

Annotations

The annotations are bounding boxes drawn around marine vessels so that some amount of their wakes, if present, are also contained within the boxes. The data are distributed as geopackage files, so that one geopackage corresponds to a single Sentinel-2 tile, and each package has separate layers for individual products as shown below:

T34VEM

|-20220619

|-20220721

|-20220813

All layers have a column id, which has the value boat for all annotations.

CRS is EPSG:32634 for all products except for the Gulf of Finland (35VLG), which is in EPSG:32635. This is done in order to have the bounding boxes to be aligned with the pixels in the imagery.

As tiles 34VEM and 34VEN have an overlap of 9.5x100 km, 34VEN is not annotated from the overlapping part to prevent data leakage between splits.

Annotation process

The minimum size for an object to be considered as a potential marine vessel was set to 2x2 pixels. Three separate acquisitions for each location were used to detect smallest objects, so that if an object was located at the same place in all images, then it was left unannotated. The data were annotated by two experts.

Product name	Number of annotations
S2B_MSIL1C_20220619T100029_N0400_R122_T34VEM_20220619T104419	591
S2A_MSIL1C_20220721T095041_N0400_R079_T34VEM_20220721T115325	1518
S2A_MSIL1C_20220813T095601_N0400_R122_T34VEM_20220813T120233	1368
S2B_MSIL1C_20220606T095029_N0400_R079_T35VLG_20220606T105944	248
S2B_MSIL1C_20220626T095039_N0400_R079_T35VLG_20220626T104321	1206
S2A_MSIL1C_20220721T095041_N0400_R079_T35VLG_20220721T115325	971
S2A_MSIL1C_20220627T100611_N0400_R022_T34WFT_20220627T134958	122
S2B_MSIL1C_20220712T100559_N0400_R022_T34WFT_20220712T121613	162
S2B_MSIL1C_20220828T095549_N0400_R122_T34WFT_20220828T104748	98
S2B_MSIL1C_20210714T100029_N0301_R122_T34VEN_20210714T121056	450
S2A_MSIL1C_20220624T100041_N0400_R122_T34VEN_20220624T120211	424
S2A_MSIL1C_20220813T095601_N0400_R122_T34VEN_20220813T120233	399
S2A_MSIL1C_20220617T100611_N0400_R022_T34VER_20220617T135008	83
S2B_MSIL1C_20220712T100559_N0400_R022_T34VER_20220712T121613	183
S2A_MSIL1C_20220826T100611_N0400_R022_T34VER_20220826T135136	88

Annotation statistics

Sentinel-2 images have spatial resolution of 10 m, so below statistics can be converted to pixel sizes by dividing them by 10 (diameter) pr 100 (area).

	mean	min	25%	50%	75%	max
Area (m²)	5305.7	567.9	1629.9	2328.2	5176.3	414795.7
Diameter (m)	92.5	33.9	57.9	69.4	108.3	913.9

As most of the annotations cover also most of the wake of the marine vessel, the bounding boxes are significantly larger than a typical boat. There are a few annotations larger than 100 000 m², which are either cruise or cargo ships that are travelling along ordinal directions instead of cardinal directions, instead of e.g. smaller leisure boats.

Annotations typically have diameter less than 100 meters, and the largest diameters correspond to similar instances than the largest bounding box areas.

Train-test-split

We used tiles 34VEN and 34VER as the test dataset. The results acquired using RGB mosaics generated from L1C images are shown in the below table

Model	Fold	Precision	Recall	mAP50	mAP
yolov8n	1	0,820806	0.838353	0.842	0.403
yolov8s	4	0.843822	0.860479	0.865	0.422
yolov8m	4	0.858263	0.874616	0.880	0.453
yolov8l	1	0.840311	0.863553	0.862	0.443
yolov8x	1	0.855134	0.859865	0.876	0.450

Before evaluating, the predictions for the test set are cleaned using the following steps:

1. All prediction whose centroid points are not located on water are discarded. The water mask used contains layers `jarvi` (Lakes), `meri` (Sea) and `virtavesialue` (Rivers as polygon geometry) from the Topographical database by the National Land Survey of Finland. Unfortunately this also discards all points not within the Finnish borders.

2. All predictions whose centroid points are located on water rock areas are discarded. The mask is the layer `vesikivikko` (Water rock areas) from the Topographical database.

3. All predictions that contain an above water rock within the bounding box are discarded. The mask contains classes `38511`, `38512`, `38513` from the layer `vesikivi` in the Topographical database.

4. All predictions that contain a lighthouse or a sector light within the bounding box are discarded. Lighthouses and sector lights come from Väylävirasto data, `ty_njr` class ids are 1, 2, 3, 4, 5, 8

5. All predictions that are wind turbines, found in Topographical database layer `tuulivoimalat`

6. All predictions that are obviously too large are discarded. The prediction is defined to be "too large" if either of its edges is longer than 750 meters.

Model checkpoints are available on Hugging Face platform: https://huggingface.co/mayrajeo/marine-vessel-detection-yolov8

Usage

The simplest way to chip the rasters into suitable format and convert the data to COCO or YOLO formats is to use geo2ml. First download the raw mosaics and convert them into GeoTiff files and then use the following to generate the datasets.

To generate COCO format dataset run

from geo2ml.scripts import create_coco_dataset
raster_path = '<path_to_raster>'
outpath = '<path_to_save_the_dataset>'
poly_path = '<path_to_gpkg>'
layer = '<date_of_raster>'
create_coco_dataset(raster_path=raster_path, polygon_path=poly_path, target_column='id',
                    gpkg_layer=layer, outpath=outpath, save_grid=False,
                    dataset_name='<name_of_dataset>', gridsize_x=320, gridsize_y=320,
                    ann_format='box', min_bbox_area=0)

To generate YOLO format dataset run

from geo2ml.scripts import create_yolo_dataset
raster_path = '<path_to_raster>'
outpath = '<path_to_save_the_dataset>'
poly_path = '<path_to_gpkg>'
layer = '<date_of_raster>'
create_yolo_dataset(raster_path=raster_path, polygon_path=poly_path, target_column='id',
                    gpkg_layer=layer, outpath=outpath, save_grid=False,
                    gridsize_x=320, gridsize_y=320, ann_format='box', min_bbox_area=0)

Files

Files (2.4 MB)

Name	Size	Download all
34VEM.gpkg md5:1cf9efb4b471e7531ff63ab62572712a	852.0 kB	Download
34VEN.gpkg md5:d7e3cff950d40951a7cfe47853030dda	409.6 kB	Download
34VER.gpkg md5:31746315aeace4e523894e85ad2ce9ef	217.1 kB	Download
34WFT.gpkg md5:b331d5a4a6dc4936d0dcc217acdd4d5e	217.1 kB	Download
35VLG.gpkg md5:71d8dbd02c3e5032f1a244e705f4b13b	663.6 kB	Download

Additional details

LIFE-IP BIODIVERSEA LIFE20 IPE/FI/000020: European Commission

	All versions	This version
Views	227	227
Downloads	132	132
Data volume	66.5 MB	66.5 MB

Dataset for marine vessel detection from Sentinel 2 images in the Finnish coast

Creators

Contributors

Rights holder:

Description

Source data

Annotations

Annotation process

Annotation statistics

Train-test-split

Usage

Files

Files (2.4 MB)

Additional details

Funding