Published July 16, 2025 | Version v2
Dataset Open

Histo-Miner: NucSeg and TumSeg datasets

  • 1. University of Cologne
  • 2. Department for Dermatology, University Hospital Cologne

Description

I. General

Training dataset used for Histo-Miner paper. 

2 datasets were used to train SCC-Hovernet: 

  • UncuratedSCC
  • NucSeg

1 dataset was used to train SCC-Segmenter:

  • TumSeg

II. NucSeg Datasets

The dataset is available here: NucSeg.zip.

The dataset consists of annotated H&E patches for which the cell nucei are segmented and classified. 47,392 nuclei were labeled in total (3,135 granulocytes, 12,263 lymphocytes, 3,271 plasma cells, 11,526 stromal cells, 17,197 tumor cells). The dataset is composed of 6,816 patches of 560x560 pixels with 70% overlap in a 5D numpy array according to the Hovernet data format requirements. The patches are coming from 24WSIs of 20 cSCC patients. The resolutions of the images are a mix of 40x and 20x (see IV. Patient IDs for more information). The channels of the arrays are  [RGB, inst, type] where:

  • 'RGB' is the 3 channels raw image 
  • 'inst'  is the instance segmentation ground truth: every pixel range from 0 to N, where 0 is background and N is the number of nuclear instances 
  •  'type' is the nuclear type ground truth:  every pixel ranges from 0-K, where 0 is background and K is the number of classes.

 

The dataset format is fitting Hovernet-like architecture training but is not conveniant for any visualization or training of other models. This is why, another more conventional format is available for this dataset, and you can see it here: NucSeg_OriginalFormat.zip. In this case the 'RGB', 'inst', 'type' data are saved in numpy format in different folders (RawImages, InstanceMaps, ClassMaps). For instance the user can apply the functions  save2dnpy_2png and save3dnpy_2png from  histo_miner.utils.filemanagement to generate PNG from these files. The dataset contains 1,707 H&E non-overlapping patches of 256x256 pixels with no overlap.

 

As described in the paper, the SCC Hovernet model was first pretrained with a Not-Curated dataset, meaning the segmentation and cell classification contains several errors, that are not quantified. It is not recommanded to use this dataset for training, only for pre-training as a first step preceding another training step with another dataset. This Not-Curated dataset is available here: UncuratedSCC.zip. The file organization follow the one of NucSeg. 

III. TumSeg Dataset

The dataset is available here TumSeg.zip

The dataset consists of pairs for raw WSIs images and binary segmentation images, for which the tumor region was annotated. 144 WSIs
of 125 cSCC patients were collected for this dataset. The resolution of the WSIs is downsample to 1.25x.

IV. Patient IDs

For both datasets, a csv file is available to associate each file to its corresponding patient (anonymised). For NucSeg dataset, the resolutions of the WSIs from which the patches are extracted are also shown. In version 2 of the dataset we changed the Patients IDs of TumSeg to remove missleading names. The correspondance image - patient is unchanged, only names are updated. 

V. Funding Notes

Lucas Sancéré and Kasia Bozek were supported by the North Rhine-Westphalia return program (311-8.03.03.02-147635) and hosted by the Center for Molecular Medicine Cologne. Johannes Brägelmann and Carina Lorenz received funding from a Milded Scheel Nachwuchszentrum Grant 70113307 by the German Cancer Aid (Deutsche Krebshilfe)

Files

NucSeg.zip

Files (14.5 GB)

Name Size Download all
md5:e7c716a918ac36716cfc6ae063188bd4
6.9 GB Preview Download
md5:8cd5ab08d0b6a14e3e2f1ca95f41e857
312.1 MB Preview Download
md5:fec9565682e5a762ccbbd07c738f1bb6
475 Bytes Preview Download
md5:e2d3f7480e647799a372aafa23e7e491
1.3 GB Preview Download
md5:fabbf76be961107c0d1b4f27e0af8660
3.8 kB Preview Download
md5:74cfce7a95f5338bf6f8664f328e22b0
6.0 GB Preview Download

Additional details

Software

Repository URL
https://github.com/bozeklab/histo-miner
Programming language
Python
Development Status
Active