BEETLE: A multicentric dataset for training and benchmarking breast cancer segmentation in H&E slides

Lems, Carlijn M.; Tessier, Leslie; Bokhorst, John-Melle; van Rijthoven, Mart; Aswolinskiy, Witali; Pozzi, Matteo; Klubíčková, Natálie; Dintzis, Suzanne; Campora, Michela; Balkenhol, Maschenka; Bult, Peter; Spronck, Joey Matheus Antonius; Detone, Thomas; Barbareschi, Mattia; Munari, Enrico; Bogina, Giuseppe; Wesseling, Jelle; Lips, Esther H.; Ciompi, Francesco; Meeuwsen, Frederique; van der Laak, Jeroen

doi:10.5281/zenodo.16812932

Published October 4, 2025 | Version v1

Dataset Open

BEETLE: A multicentric dataset for training and benchmarking breast cancer segmentation in H&E slides

1. Radboud University Medical Center
2. Fondazione Bruno Kessler
3. University of Trento
4. Biopticka Laborator (Czechia)
5. Charles University
6. University of Washington
7. Ospedale Santa Chiara
8. Azienda Ospedaliera Universitaria Integrata Verona
9. Ospedale Sacro Cuore Don Calabria
10. Leiden University Medical Center
11. Netherlands Cancer Institute - Antoni van Leeuwenhoek
12. The Netherlands Cancer Institute
13. Linköping University

The BrEast cancEr hisTopathoLogy sEgmentation (BEETLE) dataset provides a development set and an external evaluation set for multiclass semantic segmentation of H&E-stained breast cancer whole-slide images (WSIs), covering all molecular subtypes and histological grades.

Development set: 587 biopsies and resections collected from three collaborating clinical centers and two public datasets, digitized using seven scanners. Pixel-level annotations are available for four tissue classes: invasive epithelium, non-invasive epithelium, necrosis, and other, with particular focus on morphologies underrepresented in existing datasets, such as ductal carcinoma in situ and dispersed lobular tumor cells.
External evaluation set: 54 biopsies and resections collected from three clinical centers and digitized with three scanners. In addition to the WSIs, 170 densely annotated regions of interest (ROIs) are provided as image tiles. The corresponding pixel-level annotations are not publicly released but are sequestered on the Grand Challenge platform, where submissions are evaluated on a public leaderboard to enable standardized and comparable benchmarking of breast cancer segmentation models.

Technical info

Repository content

This repository contains three zip files:

'annotations.zip' - annotations for the development set, provided in three formats:

annotations.zip
├── jsons/ # JSON format with tissue compartments annotated as polygons
├── label_map.json # mapping of pixel values to class labels
├── masks/ # multiresolution TIFF images with pixel-wise class labels
└── xmls/ # XML format with tissue compartments annotated as polygons

'images.zip' - images for the development and evaluation sets:

images.zip
├── development/
│ └── wsis/ # whole-slide images for development
└── evaluation/
├── rois/ # PNG images of ROIs for evaluation
└── wsis/ # whole-slide images for evaluation

'model.zip' - weights of the final ensemble model used for the technical validation of the dataset.

All data is released at a spacing of ~0.5 µm/pixel. Annotations in TIFF and XML formats are compatible with ASAP 2.1 Nightly. Both XML and JSON files contain the same annotations, but JSON is formatted for compatibility with nnU-Net-for-pathology-v2. The ROI PNG images include surrounding spatial context, allowing models to incorporate neighboring tissue architecture in their predictions, similar to whole-slide inference using a sliding-window approach.

Public datasets

This dataset includes images from two public sources:

TCGA-BRCA (The Cancer Genome Atlas Breast Invasive Carcinoma)
TIGER training set

Note: WSIs from TIGER (including the TCGA-BRCA subset) must be downloaded separately from AWS Open Data. WSIs from TCGA-BRCA not in TIGER are included here. Four TIGER slides (IDs TCGA-AC-A2QH, TCGA-OL-A97C, TCGA-AR-A5QQ, TCGA-E9-A5FL) were excluded from this dataset.

File ID nomenclature

For images from the public TCGA-BRCA and TIGER datasets, we retained the original anonymized filenames provided by the respective sources. For all other images, we assigned each patient a unique anonymous patient ID, incrementing from 1. Because a single patient may have multiple WSIs, WSIs and annotations are named according to patient ID and WSI ID using the convention <anonymous_patient_id>_<wsi_id>.<suffix>, for example 'patient1_wsi1.tif'. For the evaluation set, ROIs are additionally indexed by ROI ID, following the convention <anonymous_patient_id>_<wsi_id>_<roi_id>.<suffix>, for example 'patient1_wsi1_roi1.png'.

Dataset overview

Lastly, we include a 'data_overview.csv' file that documents metadata per WSI. We provide a table below that lists the metadata contained in each column.

Column	Contents
'patient_id'	Unique anonymous patient ID, see ‘File ID nomenclature’
'wsi_id'	WSI ID, see 'File ID nomenclature'
'name'	Full name of WSI, e.g., 'patient1_wsi1'
'source'	(Clinical center) data source: 'biopticka', 'jb', 'nki', 'rumc', 'scdc', 'sch', 'tcga', or 'uwmedicine'
'specimen_type'	WSI specimen type: ‘biopsy’ or ‘resection’
'scanner'	Scanner used to digitize the image
'wsi_path'	WSI path starting with 'images/' (for non-TIGER images)
'annotation_mask_path'	Path to the TIFF mask file (development set only), starting with 'annotations/'
'annotation_xml_path'	Path to the XML annotation file (development set only), starting with 'annotations/'
'annotation_json_path'	Path to the JSON annotation file (development set only), starting with 'annotations/'
'split'	Dataset split: development/evaluation
'validation_fold'	Validation fold of 5-fold cross-validation (development set only)

Files

annotations.zip

Files (150.9 GB)

Name	Size	Download all
annotations.zip md5:20c3ed8ae74a392eb2b4ba2baf75494a	1.8 GB	Preview Download
data_overview.csv md5:fa3093bec3a80e7f3464dcc7cbc36fad	176.3 kB	Preview Download
images.zip md5:7bfd8524615fc4914b9998b8bfc80f9e	147.2 GB	Preview Download
model.zip md5:c1c82ed123484cd8760b6343feeee14f	1.9 GB	Preview Download

Additional details

Is described by: Preprint: arXiv:2510.02037 (arXiv)

Dutch Cancer Society
Computational Pathology for Improved Treatment Decision Making for Breast Cancer Patients - the COMMITMENT project. 15386

Repository URL: https://github.com/DIAGNijmegen/beetle
Programming language: Python
Development Status: Active

	All versions	This version
Views	500	500
Downloads	655	655
Data volume	211.5 TB	211.5 TB

BEETLE: A multicentric dataset for training and benchmarking breast cancer segmentation in H&E slides

Technical info

Files

annotations.zip

Files (150.9 GB)

Additional details

Related works

Funding

Software

BEETLE: A multicentric dataset for training and benchmarking breast cancer segmentation in H&E slides

Creators

Description

Technical info

Files

annotations.zip

Files (150.9 GB)

Additional details

Related works

Funding

Software