Histo-Miner: NucSeg and TumSeg datasets

Sancéré, Lucas; Lorenz, Carina; Brägelmann, Johannes; Bozek, Katarzyna; Helbig, Doris

doi:10.5281/zenodo.15973142

Published July 16, 2025 | Version v2

Dataset Open

Histo-Miner: NucSeg and TumSeg datasets

1. University of Cologne
2. Department for Dermatology, University Hospital Cologne

I. General

Training dataset used for Histo-Miner paper.

2 datasets were used to train SCC-Hovernet:

UncuratedSCC
NucSeg

1 dataset was used to train SCC-Segmenter:

TumSeg

II. NucSeg Datasets

The dataset is available here: NucSeg.zip.

The dataset consists of annotated H&E patches for which the cell nucei are segmented and classified. 47,392 nuclei were labeled in total (3,135 granulocytes, 12,263 lymphocytes, 3,271 plasma cells, 11,526 stromal cells, 17,197 tumor cells). The dataset is composed of 6,816 patches of 560x560 pixels with 70% overlap in a 5D numpy array according to the Hovernet data format requirements. The patches are coming from 24WSIs of 20 cSCC patients. The resolutions of the images are a mix of 40x and 20x (see IV. Patient IDs for more information). The channels of the arrays are [RGB, inst, type] where:

'RGB' is the 3 channels raw image
'inst' is the instance segmentation ground truth: every pixel range from 0 to N, where 0 is background and N is the number of nuclear instances
'type' is the nuclear type ground truth: every pixel ranges from 0-K, where 0 is background and K is the number of classes.

The dataset format is fitting Hovernet-like architecture training but is not conveniant for any visualization or training of other models. This is why, another more conventional format is available for this dataset, and you can see it here: NucSeg_OriginalFormat.zip. In this case the 'RGB', 'inst', 'type' data are saved in numpy format in different folders (RawImages, InstanceMaps, ClassMaps). For instance the user can apply the functions save2dnpy_2png and save3dnpy_2png from histo_miner.utils.filemanagement to generate PNG from these files. The dataset contains 1,707 H&E non-overlapping patches of 256x256 pixels with no overlap.

As described in the paper, the SCC Hovernet model was first pretrained with a Not-Curated dataset, meaning the segmentation and cell classification contains several errors, that are not quantified. It is not recommanded to use this dataset for training, only for pre-training as a first step preceding another training step with another dataset. This Not-Curated dataset is available here: UncuratedSCC.zip. The file organization follow the one of NucSeg.

III. TumSeg Dataset

The dataset is available here TumSeg.zip.

The dataset consists of pairs for raw WSIs images and binary segmentation images, for which the tumor region was annotated. 144 WSIs
of 125 cSCC patients were collected for this dataset. The resolution of the WSIs is downsample to 1.25x.

IV. Patient IDs

For both datasets, a csv file is available to associate each file to its corresponding patient (anonymised). For NucSeg dataset, the resolutions of the WSIs from which the patches are extracted are also shown. In version 2 of the dataset we changed the Patients IDs of TumSeg to remove missleading names. The correspondance image - patient is unchanged, only names are updated.

V. Funding Notes

Lucas Sancéré and Kasia Bozek were supported by the North Rhine-Westphalia return program (311-8.03.03.02-147635) and hosted by the Center for Molecular Medicine Cologne. Johannes Brägelmann and Carina Lorenz received funding from a Milded Scheel Nachwuchszentrum Grant 70113307 by the German Cancer Aid (Deutsche Krebshilfe)

Files

NucSeg.zip

Files (14.5 GB)

Name	Size	Download all
NucSeg.zip md5:e7c716a918ac36716cfc6ae063188bd4	6.9 GB	Preview Download
NucSeg_OriginalFormat.zip md5:8cd5ab08d0b6a14e3e2f1ca95f41e857	312.1 MB	Preview Download
NucSeg_PatientID.csv md5:fec9565682e5a762ccbbd07c738f1bb6	475 Bytes	Preview Download
TumSeg.zip md5:e2d3f7480e647799a372aafa23e7e491	1.3 GB	Preview Download
TumSeg_PatientID.csv md5:fabbf76be961107c0d1b4f27e0af8660	3.8 kB	Preview Download
UncuratedSCC.zip md5:74cfce7a95f5338bf6f8664f328e22b0	6.0 GB	Preview Download

Additional details

Repository URL: https://github.com/bozeklab/histo-miner
Programming language: Python
Development Status: Active

	All versions	This version
Views	315	179
Downloads	394	224
Data volume	1.3 TB	626.2 GB

Histo-Miner: NucSeg and TumSeg datasets

Authors/Creators

Description

I. General

II. NucSeg Datasets

III. TumSeg Dataset

IV. Patient IDs

V. Funding Notes

Files

NucSeg.zip

Files (14.5 GB)

Additional details

Software