There is a newer version of the record available.

Published June 20, 2025 | Version v1
Dataset Open

IGNITE data toolkit: a tissue and cell-level annotated H&E and PD-L1 histopathology image dataset in non-small cell lung cancer

Description

Please see to the newest version of this Zenodo repository by following the link in 'Versions' on the right hand side of this webpage.

We introduce the IGNITE data toolkit, a multi-stain, multi-centric, and multi-scanner dataset of annotated non-small cell lung cancer (NSCLC) whole-slide images (WSIs). We publicly release 887 fully annotated regions of interest (ROI) from 155 unique patients across three complementary tasks:

  1. Multi-class semantic segmentation of tissue compartments in H&E-stained slides, with 16 classes spanning primary and metastatic NSCLC
  2. Nuclei detection in PD-L1 stained immunohistochemistry (IHC)
  3. Positive tumor cell detection in PD-L1 IHC

Technical info

Repository content

This repository contains four zip files, with each of the files having the following directory structure when unpacked:

.
└── {images,annotations,models,inference,figures}/
    ├── he/               # Files pertaining to the H&E tissue compartment segmentation dataset...
    └── pdl1/
        ├── nuclei/    # ... the PD-L1 IHC nuclei detection dataset..
        └── pdl1/       # ... and the PD-L1 positive tumor cell detection dataset

The four zip files contain the following:

  • 'annotations.zip' contains single-channel PNG masks for the H&E tissue compartment segmentation dataset (the label map is under 'he_label_map.json'); the zip file also contains MS COCO-formatted JSON files for the nuclei/PD-L1 positive tumor cell detection datasets
  • 'figures.zip' contains neatly visualized inference and evaluation metric figures from our paper
  • 'images.zip' contains PNG images of the ROIs released in the toolkit
  • 'inference.zip' contains raw inference of the models for the respective datasets
  • 'models.zip' contains the weights for our final models used for the technical validation of the toolkit

File ID nomenclature

All patients were assigned a unique anonymous patient ID incrementing from 1. Images/masks are named according to the patient/dataset/ROI they originate from, following the naming scheme <anonymous_patient_id>_<dataset>_<roi_id>.<suffix>, e.g 'patient1_he_roi1.png. Note that some patients occur in multiple datasets, but always keep the same anonymous patient ID. However, their ROIs are always different across datasets, e.g. 'patient1_he_roi1.png''patient1_nuclei_roi1.png'  and 'patient1_pdl1_roi1.png' all refer to separate, non-overlapping regions.

Dataset overview

Lastly, we include a 'data_overview.csv' file that documents metadata per ROI. We provide a table below that lists what metadata each column contains.

Column Contents
‘patient_id’ Unique anonymous patient ID. See ‘File ID nomenclature’.
‘roi_id’ ROI ID, see ‘File ID nomenclature’.
‘name’ Full name of ROI, e.g. ‘patient1_he_roi1’
‘task’ Dataset label: ’he_tissue_segmentation’, ‘nuclei_detection’ or ‘pdl1_detection’
‘source’ (Hospital) data source: ‘rumc’, ‘scdc’ or ‘tcga’
‘specimen_type’ WSI specimen type: ‘biopsy’, ‘resection’ or ‘tissue_micro_array’
‘organ’ Organ the tissue originated from
‘histological_subtype’ NSCLC subtype of the parent WSI (not necessarily of the ROI, as it may not contain tumor cells).
‘stain’ ‘H&E’ or ‘PDL1_{monoclone}’
‘scanner’ Scanner used to digitize the image
‘image_path’ Image path relative to ‘data/’
‘annotation_path’ Annotation path relative to ‘data/’
‘shape’ (width,height) shape of the ROI. Important caveat: for ROIs released with non-annotated context borders, this shape refers only to the annotated part of the image.
‘area_mm2’ Annotated ROI area in mm^2
‘split’ Dataset split: train/validation/test
‘validation_fold’ For H&E tissue compartment segmentation dataset, validation fold of 5 fold cross validation
'original_tcga_id' For cases originating from the TCGA dataset, we list their original TCGA ID.

 

Files

data_overview.csv

Files (8.2 GB)

Name Size Download all
md5:723b37e06f6b3765fa04165d7c76134a
30.5 MB Preview Download
md5:e534225a62c31a6ed3352fb6481ad9e6
255.3 kB Preview Download
md5:b0c735446f1e378004c05bee7342f464
330.1 MB Preview Download
md5:a3dad19173c194cd6be569515a97078c
413 Bytes Preview Download
md5:1a8bf1f245d1c7c7405f8f869b04cacc
5.7 GB Preview Download
md5:188f00b32fca4ddea72608646489c749
7.1 MB Preview Download
md5:31a34306d6045d5ca80becb90b5d51b5
2.1 GB Preview Download

Additional details

Funding

Dutch Research Council
Predicting Lung Cancer Immunotherapy Response. It’s personal. 18388

Software

Repository URL
https://github.com/DIAGNijmegen/ignite-data-toolkit
Programming language
Python
Development Status
Wip