Published September 1, 2021 | Version v1
Dataset Open

Histopathology images for end-to-end AI, based on TCGA-BRCA

  • 1. University Hospital RWTH Aachen

Description

These are histopathological images which are derived from the TCGA-BRCA breast cancer histology dataset at https://portal.gdc.cancer.gov/ (please check this website for the original data license). They can be used for end-to-end artificial intelligence (AI) workflows such as DeepMed (https://github.com/KatherLab/deepmed) which aim to predict high-level features directly from digital images with weakly supervised transfer learning. Here, we use two subsets of these digitized images:

1) TCGA-BRCA-A2, these are all images from Walter Reed National Military Medical Center (tissue source site code A2, N=100 images) in the TCGA-BRCA database (tcga-brca-a2-deepmed-tiles.zip)

2) TCGA-BRCA-E2, these are all images from Roswell Park Comprehensive Cancer Center (tissue source site code E2, N=90 images) in the TCGA-BRCA database (tcga-brca-e2-deepmed-tiles.zip)

see also https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tissue-source-site-codes 

The images were preprocessed according to the Aachen Protocol for Deep Learning Histopathology which is available at https://zenodo.org/record/3694994. Specifically, digital whole slide images (SVS format) of hematoxylin & eosin (H&E) stained slides were tessellated (without manual annotations) into tiles of 256x256 px edge length at 1 µm/px. Then, images were color-normalized using the Macenko method as described before (https://www.nature.com/articles/s43018-020-0087-6) and saved as JPEG files. For the A2 cohort, an additional ZIP archive is provided in which only 100 random image tiles are saved for each patient (tcga-brca-a2-deepmed-tiles_100.zip). In addition, we provide a CLINI and a SLIDE table as defined in the "Aachen Protocol". The CLINI table contains clinico-pathological data for all included patients and it is derived from clinical information on www.cbioportal.org as well as from Thorsson et al. (https://pubmed.ncbi.nlm.nih.gov/29628290/). We recommend to use the A2 dataset for training and the E2 dataset for testing. Please cite the relevant papers if you re-use this dataset, more information is available on www.kather.ai

Files

TCGA-BRCA-A2-DEEPMED-TILES.zip

Files (24.3 GB)

Name Size Download all
md5:c696f1b7defb581db1557ae769833d25
153.2 kB Download
md5:85cfc54af021e2956cb1f472bc55c451
12.8 GB Preview Download
md5:8e439785cb219c72d2b867e0411da577
464.0 MB Preview Download
md5:3692048b393552f44646ce4ffae97351
13.5 kB Download
md5:4651c9f2f987e0fdbe9c13d41f3daf28
138.6 kB Download
md5:9b37112c14682e6f42a51e0ca336a1b5
11.1 GB Preview Download
md5:62b6dd71e811b89f906fdc8d5907341b
13.0 kB Download