Published October 4, 2025 | Version v1
Dataset Open

BEETLE: A multicentric dataset for training and benchmarking breast cancer segmentation in H&E slides

  • 1. ROR icon Radboud University Medical Center
  • 2. ROR icon Fondazione Bruno Kessler
  • 3. ROR icon University of Trento
  • 4. ROR icon Biopticka Laborator (Czechia)
  • 5. ROR icon Charles University
  • 6. ROR icon University of Washington
  • 7. ROR icon Ospedale Santa Chiara
  • 8. ROR icon Azienda Ospedaliera Universitaria Integrata Verona
  • 9. ROR icon Ospedale Sacro Cuore Don Calabria
  • 10. Leiden University Medical Center
  • 11. Netherlands Cancer Institute - Antoni van Leeuwenhoek
  • 12. ROR icon The Netherlands Cancer Institute
  • 13. ROR icon Linköping University

Description

The BrEast cancEr hisTopathoLogy sEgmentation (BEETLE) dataset provides a development set and an external evaluation set for multiclass semantic segmentation of H&E-stained breast cancer whole-slide images (WSIs), covering all molecular subtypes and histological grades.

  • Development set: 587 biopsies and resections collected from three collaborating clinical centers and two public datasets, digitized using seven scanners. Pixel-level annotations are available for four tissue classes: invasive epithelium, non-invasive epithelium, necrosis, and other, with particular focus on morphologies underrepresented in existing datasets, such as ductal carcinoma in situ and dispersed lobular tumor cells.
  • External evaluation set: 54 biopsies and resections collected from three clinical centers and digitized with three scanners. In addition to the WSIs, 170 densely annotated regions of interest (ROIs) are provided as image tiles. The corresponding pixel-level annotations are not publicly released but are sequestered on the Grand Challenge platform, where submissions are evaluated on a public leaderboard to enable standardized and comparable benchmarking of breast cancer segmentation models.

Technical info

Repository content

This repository contains three zip files:

  • 'annotations.zip' - annotations for the development set, provided in three formats:

annotations.zip
├── jsons/ # JSON format with tissue compartments annotated as polygons 
├── label_map.json # mapping of pixel values to class labels
├── masks/ # multiresolution TIFF images with pixel-wise class labels
└── xmls/ # XML format with tissue compartments annotated as polygons

  • 'images.zip' - images for the development and evaluation sets:

images.zip
├── development/ 
│   └── wsis/ # whole-slide images for development
└── evaluation/
    ├── rois/ # PNG images of ROIs for evaluation
    └── wsis/ # whole-slide images for evaluation

  • 'model.zip' - weights of the final ensemble model used for the technical validation of the dataset.

All data is released at a spacing of ~0.5 µm/pixel. Annotations in TIFF and XML formats are compatible with ASAP 2.1 Nightly. Both XML and JSON files contain the same annotations, but JSON is formatted for compatibility with nnU-Net-for-pathology-v2. The ROI PNG images include surrounding spatial context, allowing models to incorporate neighboring tissue architecture in their predictions, similar to whole-slide inference using a sliding-window approach.

Public datasets

This dataset includes images from two public sources:

Note: WSIs from TIGER (including the TCGA-BRCA subset) must be downloaded separately from AWS Open Data. WSIs from TCGA-BRCA not in TIGER are included here. Four TIGER slides (IDs TCGA-AC-A2QH, TCGA-OL-A97C, TCGA-AR-A5QQ, TCGA-E9-A5FL) were excluded from this dataset.

File ID nomenclature

For images from the public TCGA-BRCA and TIGER datasets, we retained the original anonymized filenames provided by the respective sources. For all other images, we assigned each patient a unique anonymous patient ID, incrementing from 1. Because a single patient may have multiple WSIs, WSIs and annotations are named according to patient ID and WSI ID using the convention <anonymous_patient_id>_<wsi_id>.<suffix>, for example 'patient1_wsi1.tif'. For the evaluation set, ROIs are additionally indexed by ROI ID, following the convention <anonymous_patient_id>_<wsi_id>_<roi_id>.<suffix>, for example 'patient1_wsi1_roi1.png'.

Dataset overview

Lastly, we include a 'data_overview.csv' file that documents metadata per WSI. We provide a table below that lists the metadata contained in each column.

Column Contents
'patient_id' Unique anonymous patient ID, see ‘File ID nomenclature’
'wsi_id' WSI ID, see 'File ID nomenclature'
'name' Full name of WSI, e.g., 'patient1_wsi1'
'source' (Clinical center) data source: 'biopticka', 'jb', 'nki', 'rumc', 'scdc', 'sch', 'tcga', or 'uwmedicine'
'specimen_type' WSI specimen type: ‘biopsy’ or ‘resection’
'scanner' Scanner used to digitize the image
'wsi_path' WSI path starting with 'images/' (for non-TIGER images)
'annotation_mask_path' Path to the TIFF mask file (development set only), starting with 'annotations/'
'annotation_xml_path' Path to the XML annotation file (development set only), starting with 'annotations/'
'annotation_json_path' Path to the JSON annotation file (development set only), starting with 'annotations/'
'split' Dataset split: development/evaluation
'validation_fold' Validation fold of 5-fold cross-validation (development set only)

Files

annotations.zip

Files (150.9 GB)

Name Size Download all
md5:20c3ed8ae74a392eb2b4ba2baf75494a
1.8 GB Preview Download
md5:fa3093bec3a80e7f3464dcc7cbc36fad
176.3 kB Preview Download
md5:7bfd8524615fc4914b9998b8bfc80f9e
147.2 GB Preview Download
md5:c1c82ed123484cd8760b6343feeee14f
1.9 GB Preview Download

Additional details

Related works

Is described by
Preprint: arXiv:2510.02037 (arXiv)

Funding

Dutch Cancer Society
Computational Pathology for Improved Treatment Decision Making for Breast Cancer Patients - the COMMITMENT project. 15386

Software

Repository URL
https://github.com/DIAGNijmegen/beetle
Programming language
Python
Development Status
Active