BoneMarrowWSI-PediatricLeukemia: A Comprehensive Dataset of Bone Marrow Aspirate Smear Whole Slide Images with Expert Annotations and Clinical Data in Pediatric Leukemia

Höfener, Henning; Kock, Farina; Pontones, Martina Ayelén; Ghete, Tabita; Pfrang, David; Dickel, Nicholas; Kunz, Meik; Schacherer, Daniela; Clunie, David A; Fedorov, Andrey; Westphal, Max; Metzler, Markus

doi:10.5281/zenodo.16995570

Published September 2025 | Version v3

Dataset Open

BoneMarrowWSI-PediatricLeukemia: A Comprehensive Dataset of Bone Marrow Aspirate Smear Whole Slide Images with Expert Annotations and Clinical Data in Pediatric Leukemia

1. Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany
2. Universitätsklinikum Erlangen
3. Friedrich-Alexander-Universität Erlangen-Nürnberg
4. Medical Informatics, Friedrich-Alexander University of Erlangen-Nürnberg, Erlangen, Germany
5. PixelMed Publishing
6. Brigham and Women's Hospital Department of Radiology
7. Fraunhofer Institute for Digital Medicine
8. Department of Pediatrics and Adolescent Medicine, University Hospital Erlangen, Erlangen, Germany

This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here. You can use IDC Portal or the manifests included in this Zenodo record to download the entire collection following the Download instructions below, or you can use IDC Portal to download images for specific cases.

Collection Description

Image data: The dataset comprises bone marrow aspirate smear WSI for 246 pediatric cases (< 18 years) of leukemia, including acute lymphoid leukemia (ALL), acute myeloid leukemia (AML), and chronic myeloid leukemia (CML). The smears were prepared for the initial diagnosis (i.e., without prior treatment), stained in accordance with the Pappenheim method, and scanned at 40x magnification (without immersion), resulting in a resolution of 0.11x0.11 µm/pixel.

Metadata: Additionally, clinical information (age group, sex, diagnosis) and laboratory data (blasts, white blood cell count, thrombocytes, LDH, uric acid, hemoglobin) are available for each case.

Annotations: The images have been annotated with rectangular regions of interest (ROI) within the evaluable monolayer area, and a total of 47176 cell bounding box annotations have been placed within the regions of interest. Cells have been annotated by multiple experts in a consensus labeling approach with 49 distinct cell type classes. This consensus approach entailed that each cell was sequentially annotated by multiple individuals until each cell had been labeled by at least two individuals, and the majority class was assigned in at least half of all annotations for that image. The labels from all annotation sessions, as well as the final consensus class for each cell, will be made available in the next IDC version.

The accompanying preprint describes the study and the dataset in detail. Conversion into DICOM was done using the scripts in https://github.com/ImagingDataCommons/conversion_mirax_dicom, which rely on the `wsidicomizer` library.

Files included

Images were originally obtained as proprietary MIRAX files using 3DHistech scanners, but were afterwards converted to standard DICOM format by the IDC team. Clinical data are contained in the DICOM metadata. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.

A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, collection_id-idc_v22-aws.s5cmd corresponds to the contents of the collection_id collection introduced in IDC data release v22. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.

bmdeep-idc_v22-aws.s5cmd: manifest of files available for download from public IDC Amazon Web Services buckets
bmdeep-idc_v22-gcs.s5cmd: manifest of files available for download from public IDC Google Cloud Storage buckets

Note that manifest files that end in -aws.s5cmd reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.

In addition lab values are available as BigQuery table: tbd

Download instructions

Each of the manifests include instructions in the header on how to download the included files.

To download the files using .s5cmd manifests:

install idc-index package: pip install --upgrade idc-index
download the files referenced by manifests included in this dataset by passing the .s5cmd manifest file: idc download manifest.s5cmd.

To download the files using .dcf manifest, see manifest header.

References

1. Höfener, H., Kock, F., Pontones, M., Ghete, T., Pfrang, D., Dickel, N., Kunz, M., Schacherer, D. P., Clunie, D. A., Fedorov, A., Westphal, M. & Metzler, M. From data to diagnosis: A large, comprehensive bone marrow dataset and AI methods for childhood leukemia prediction. arXiv [cs.LG] (2025). at <http://arxiv.org/abs/2509.15895>

Acknowledgments

The authors thank Stefanie Barnickel, Nathalie Dollmann, Tatjana Flamann, Meinolf Suttorp, and Perdita Weller for the labelling of the cells.

The authors thank the following institutions for supplying BMA smears: University Hospital Augsburg (Univ.-Prof. Dr. Dr. med. Michael Frühwald), Charité Berlin - ALL-REZ BFM Study Group (PD Dr. med. Arend von Stackelberg), University Hospital at the TU Dresden (Prof. Dr. med. Meinolf Suttorp), University Hospital Essen - AML-BFM Study Group (Prof. Dr. Dirk Reinhardt), Technical University of Munich (Prof. Dr. med. Irene Teichert-von Lüttichau), University Hospital Würzburg (Prof. Dr. med. Matthias Eyrich).

This study was supported by a grant from the German Federal Ministry of Education and Research (FKZ: 031L0262A; BMDeep)

Preparation of the Dataset for publication was partly supported by Federal funds from the National Cancer Institute, National Institutes of Health (Task Order No. HHSN26110071 under Contract HHSN261201500003l).

The entire dataset is made available in National Cancer Institute Imaging Data Commons (https://imaging.datacommons.cancer.gov). If you have any questions about the dataset please contact IDC support at support@canceridc.dev.

Files

Files (99.1 kB)

Name	Size	Download all
bonemarrowwsi_pediatricleukemia-idc_v22-aws.s5cmd md5:4db23568d848ee40c6ebf963d3bc1632	16.2 kB	Download
bonemarrowwsi_pediatricleukemia-idc_v22-dcf.dcf md5:19b9f50bd4e17731fade585b718f83f0	66.8 kB	Download
bonemarrowwsi_pediatricleukemia-idc_v22-gcs.s5cmd md5:a670e8528ba14dae0725166183696e50	16.2 kB	Download

Additional details

Cites: Publication: 10.1148/rg.230180 (DOI)
Is described by: Preprint: 10.48550/arXiv.2509.15895 (DOI)
Is published in: Other: 10.25504/FAIRsharing.0b5a1d (DOI)

	All versions	This version
Views	478	102
Downloads	2,338	28
Data volume	1.9 TB	857.6 kB

BoneMarrowWSI-PediatricLeukemia: A Comprehensive Dataset of Bone Marrow Aspirate Smear Whole Slide Images with Expert Annotations and Clinical Data in Pediatric Leukemia

Creators

Description

Collection Description

Files included

Download instructions

References

Acknowledgments

Files

Files (99.1 kB)

Additional details

Related works