BEETLE: A multicentric dataset for training and benchmarking breast cancer segmentation in H&E slides
Creators
-
Lems, Carlijn M.1
-
Tessier, Leslie1
-
Bokhorst, John-Melle1
-
van Rijthoven, Mart1
-
Aswolinskiy, Witali1
-
Pozzi, Matteo2, 3
-
Klubíčková, Natálie4, 5
-
Dintzis, Suzanne6
-
Campora, Michela7
-
Balkenhol, Maschenka1
-
Bult, Peter1
-
Spronck, Joey Matheus Antonius1
-
Detone, Thomas7
-
Barbareschi, Mattia7, 3
-
Munari, Enrico8
-
Bogina, Giuseppe9
-
Wesseling, Jelle10, 11
-
Lips, Esther H.12
-
Ciompi, Francesco1
-
Meeuwsen, Frederique1
-
van der Laak, Jeroen1, 13
-
1.
Radboud University Medical Center
-
2.
Fondazione Bruno Kessler
-
3.
University of Trento
-
4.
Biopticka Laborator (Czechia)
-
5.
Charles University
-
6.
University of Washington
-
7.
Ospedale Santa Chiara
-
8.
Azienda Ospedaliera Universitaria Integrata Verona
-
9.
Ospedale Sacro Cuore Don Calabria
- 10. Leiden University Medical Center
- 11. Netherlands Cancer Institute - Antoni van Leeuwenhoek
-
12.
The Netherlands Cancer Institute
-
13.
Linköping University
Description
The BrEast cancEr hisTopathoLogy sEgmentation (BEETLE) dataset provides a development set and an external evaluation set for multiclass semantic segmentation of H&E-stained breast cancer whole-slide images (WSIs), covering all molecular subtypes and histological grades.
- Development set: 587 biopsies and resections collected from three collaborating clinical centers and two public datasets, digitized using seven scanners. Pixel-level annotations are available for four tissue classes: invasive epithelium, non-invasive epithelium, necrosis, and other, with particular focus on morphologies underrepresented in existing datasets, such as ductal carcinoma in situ and dispersed lobular tumor cells.
- External evaluation set: 54 biopsies and resections collected from three clinical centers and digitized with three scanners. In addition to the WSIs, 170 densely annotated regions of interest (ROIs) are provided as image tiles. The corresponding pixel-level annotations are not publicly released but are sequestered on the Grand Challenge platform, where submissions are evaluated on a public leaderboard to enable standardized and comparable benchmarking of breast cancer segmentation models.
Technical info
Repository content
This repository contains three zip files:
- 'annotations.zip' - annotations for the development set, provided in three formats:
annotations.zip
├── jsons/ # JSON format with tissue compartments annotated as polygons
├── label_map.json # mapping of pixel values to class labels
├── masks/ # multiresolution TIFF images with pixel-wise class labels
└── xmls/ # XML format with tissue compartments annotated as polygons
- 'images.zip' - images for the development and evaluation sets:
images.zip
├── development/
│ └── wsis/ # whole-slide images for development
└── evaluation/
├── rois/ # PNG images of ROIs for evaluation
└── wsis/ # whole-slide images for evaluation
- 'model.zip' - weights of the final ensemble model used for the technical validation of the dataset.
All data is released at a spacing of ~0.5 µm/pixel. Annotations in TIFF and XML formats are compatible with ASAP 2.1 Nightly. Both XML and JSON files contain the same annotations, but JSON is formatted for compatibility with nnU-Net-for-pathology-v2. The ROI PNG images include surrounding spatial context, allowing models to incorporate neighboring tissue architecture in their predictions, similar to whole-slide inference using a sliding-window approach.
Public datasets
This dataset includes images from two public sources:
- TCGA-BRCA (The Cancer Genome Atlas Breast Invasive Carcinoma)
- TIGER training set
Note: WSIs from TIGER (including the TCGA-BRCA subset) must be downloaded separately from AWS Open Data. WSIs from TCGA-BRCA not in TIGER are included here. Four TIGER slides (IDs TCGA-AC-A2QH, TCGA-OL-A97C, TCGA-AR-A5QQ, TCGA-E9-A5FL) were excluded from this dataset.
File ID nomenclature
For images from the public TCGA-BRCA and TIGER datasets, we retained the original anonymized filenames provided by the respective sources. For all other images, we assigned each patient a unique anonymous patient ID, incrementing from 1. Because a single patient may have multiple WSIs, WSIs and annotations are named according to patient ID and WSI ID using the convention <anonymous_patient_id>_<wsi_id>.<suffix>, for example 'patient1_wsi1.tif'. For the evaluation set, ROIs are additionally indexed by ROI ID, following the convention <anonymous_patient_id>_<wsi_id>_<roi_id>.<suffix>, for example 'patient1_wsi1_roi1.png'.
Dataset overview
Lastly, we include a 'data_overview.csv' file that documents metadata per WSI. We provide a table below that lists the metadata contained in each column.
Column | Contents |
'patient_id' | Unique anonymous patient ID, see ‘File ID nomenclature’ |
'wsi_id' | WSI ID, see 'File ID nomenclature' |
'name' | Full name of WSI, e.g., 'patient1_wsi1' |
'source' | (Clinical center) data source: 'biopticka', 'jb', 'nki', 'rumc', 'scdc', 'sch', 'tcga', or 'uwmedicine' |
'specimen_type' | WSI specimen type: ‘biopsy’ or ‘resection’ |
'scanner' | Scanner used to digitize the image |
'wsi_path' | WSI path starting with 'images/' (for non-TIGER images) |
'annotation_mask_path' | Path to the TIFF mask file (development set only), starting with 'annotations/' |
'annotation_xml_path' | Path to the XML annotation file (development set only), starting with 'annotations/' |
'annotation_json_path' | Path to the JSON annotation file (development set only), starting with 'annotations/' |
'split' | Dataset split: development/evaluation |
'validation_fold' | Validation fold of 5-fold cross-validation (development set only) |
Files
annotations.zip
Additional details
Related works
- Is described by
- Preprint: arXiv:2510.02037 (arXiv)
Funding
- Dutch Cancer Society
- Computational Pathology for Improved Treatment Decision Making for Breast Cancer Patients - the COMMITMENT project. 15386
Software
- Repository URL
- https://github.com/DIAGNijmegen/beetle
- Programming language
- Python
- Development Status
- Active