IGNITE data toolkit: a tissue and cell-level annotated H&E and PD-L1 histopathology image dataset in non-small cell lung cancer
Creators
-
Spronck, Joey Matheus Antonius
(Project member)1
-
van Eekelen, Leander
(Project member)1
-
van Midden, Dominique
(Project manager)1
-
Bogaerts, Joep
(Project member)1
-
Tessier, Leslie
(Project member)1
- Dechering, Valerie (Project member)1
- Demirel-Andishmand, Muradije (Project member)1
- Silva de Souza, Gabriel (Project member)1
- Nemeth, Roland (Project member)1
-
Munari, Enrico
(Project member)2
-
Bogina, Giuseppe
(Project member)2
-
Girolami, Ilaria
(Project member)3
-
Eccher, Albino
(Project member)4
-
Acs, Balazs
(Project member)5
-
Boyaci, Ceren
(Project member)5
-
Klubíčková, Natálie
(Project member)6
- Looijen-Salamon, Monika (Project member)1
-
Vos, Shoko
(Project member)1
-
Ciompi, Francesco
(Project leader)1
Description
Please see to the newest version of this Zenodo repository by following the link in 'Versions' on the right hand side of this webpage.
We introduce the IGNITE data toolkit, a multi-stain, multi-centric, and multi-scanner dataset of annotated non-small cell lung cancer (NSCLC) whole-slide images (WSIs). We publicly release 887 fully annotated regions of interest (ROI) from 155 unique patients across three complementary tasks:
- Multi-class semantic segmentation of tissue compartments in H&E-stained slides, with 16 classes spanning primary and metastatic NSCLC
- Nuclei detection in PD-L1 stained immunohistochemistry (IHC)
- Positive tumor cell detection in PD-L1 IHC
Technical info
Repository content
This repository contains four zip files, with each of the files having the following directory structure when unpacked:
.
└── {images,annotations,models,inference,figures}/
├── he/ # Files pertaining to the H&E tissue compartment segmentation dataset...
└── pdl1/
├── nuclei/ # ... the PD-L1 IHC nuclei detection dataset..
└── pdl1/ # ... and the PD-L1 positive tumor cell detection dataset
The four zip files contain the following:
- 'annotations.zip' contains single-channel PNG masks for the H&E tissue compartment segmentation dataset (the label map is under 'he_label_map.json'); the zip file also contains MS COCO-formatted JSON files for the nuclei/PD-L1 positive tumor cell detection datasets
- 'figures.zip' contains neatly visualized inference and evaluation metric figures from our paper
- 'images.zip' contains PNG images of the ROIs released in the toolkit
- 'inference.zip' contains raw inference of the models for the respective datasets
- 'models.zip' contains the weights for our final models used for the technical validation of the toolkit
File ID nomenclature
All patients were assigned a unique anonymous patient ID incrementing from 1. Images/masks are named according to the patient/dataset/ROI they originate from, following the naming scheme <anonymous_patient_id>_<dataset>_<roi_id>.<suffix>, e.g 'patient1_he_roi1.png. Note that some patients occur in multiple datasets, but always keep the same anonymous patient ID. However, their ROIs are always different across datasets, e.g. 'patient1_he_roi1.png' , 'patient1_nuclei_roi1.png' and 'patient1_pdl1_roi1.png' all refer to separate, non-overlapping regions.
Dataset overview
Lastly, we include a 'data_overview.csv' file that documents metadata per ROI. We provide a table below that lists what metadata each column contains.
| Column | Contents |
|---|---|
| ‘patient_id’ | Unique anonymous patient ID. See ‘File ID nomenclature’. |
| ‘roi_id’ | ROI ID, see ‘File ID nomenclature’. |
| ‘name’ | Full name of ROI, e.g. ‘patient1_he_roi1’ |
| ‘task’ | Dataset label: ’he_tissue_segmentation’, ‘nuclei_detection’ or ‘pdl1_detection’ |
| ‘source’ | (Hospital) data source: ‘rumc’, ‘scdc’ or ‘tcga’ |
| ‘specimen_type’ | WSI specimen type: ‘biopsy’, ‘resection’ or ‘tissue_micro_array’ |
| ‘organ’ | Organ the tissue originated from |
| ‘histological_subtype’ | NSCLC subtype of the parent WSI (not necessarily of the ROI, as it may not contain tumor cells). |
| ‘stain’ | ‘H&E’ or ‘PDL1_{monoclone}’ |
| ‘scanner’ | Scanner used to digitize the image |
| ‘image_path’ | Image path relative to ‘data/’ |
| ‘annotation_path’ | Annotation path relative to ‘data/’ |
| ‘shape’ | (width,height) shape of the ROI. Important caveat: for ROIs released with non-annotated context borders, this shape refers only to the annotated part of the image. |
| ‘area_mm2’ | Annotated ROI area in mm^2 |
| ‘split’ | Dataset split: train/validation/test |
| ‘validation_fold’ | For H&E tissue compartment segmentation dataset, validation fold of 5 fold cross validation |
| 'original_tcga_id' | For cases originating from the TCGA dataset, we list their original TCGA ID. |
Files
data_overview.csv
Files
(8.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:723b37e06f6b3765fa04165d7c76134a
|
30.5 MB | Preview Download |
|
md5:e534225a62c31a6ed3352fb6481ad9e6
|
255.3 kB | Preview Download |
|
md5:b0c735446f1e378004c05bee7342f464
|
330.1 MB | Preview Download |
|
md5:a3dad19173c194cd6be569515a97078c
|
413 Bytes | Preview Download |
|
md5:1a8bf1f245d1c7c7405f8f869b04cacc
|
5.7 GB | Preview Download |
|
md5:188f00b32fca4ddea72608646489c749
|
7.1 MB | Preview Download |
|
md5:31a34306d6045d5ca80becb90b5d51b5
|
2.1 GB | Preview Download |
Additional details
Funding
- Dutch Research Council
- Predicting Lung Cancer Immunotherapy Response. It’s personal. 18388
Software
- Repository URL
- https://github.com/DIAGNijmegen/ignite-data-toolkit
- Programming language
- Python
- Development Status
- Wip