There is a newer version of the record available.

Published February 4, 2022 | Version v1
Dataset Open

Digital Pathology Dataset for Prostate Cancer Diagnosis

  • 1. Bioinformatics Institute, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
  • 2. Department of Pathology, Tan Tock Seng Hospital, Singapore, Singapore
  • 3. School of Computing, National University of Singapore, Singapore, Singapore

Description

Links to code and  bioRxiv pre-print:

1. Multi-lens Neural Machine (MLNM) Code

2. An AI-assisted Tool For Efficient Prostate Cancer Diagnosis (bioRxiv Pre-print)

Digitized hematoxylin and eosin (H&E)-stained whole-slide-images (WSIs) of 40 prostatectomy and 59 core needle biopsy specimens were collected from 99 prostate cancer patients at Tan Tock Seng Hospital, Singapore. There were 99 WSIs in total such that each specimen had one WSI. H&E-stained slides were scanned at 40× magnification (specimen-level pixel size 0·25μm × 0·25μm) using Aperio AT2 Slide Scanner (Leica Biosystems). Institutional board review from the hospital were obtained for this study, and all the data were de-identified.

Prostate glandular structures in core needle biopsy slides were manually annotated and classified using the ASAP annotation tool (ASAP). A senior pathologist reviewed 10% of the annotations in each slide, ensuring that some reference annotations were provided to the researcher at different regions of the core. It is to be noted that partial glands appearing at the edges of the biopsy cores were not annotated.

Patches of size 512 × 512 pixels were cropped from whole slide images at resolutions 5×, 10×, 20×, and 40× with an annotated gland centered at each patch. This dataset contains these cropped images.

This dataset is used to train two AI models for Gland Segmentation (99 patients) and Gland Classification (46 patients). Tables 1 and 2 illustrate both gland segmentation and gland classification datasets. We have put the two corresponding sub-datasets as two zip files as follows:

  1. gland_segmentation_dataset.zip
  2. gland_classification_dataset.zip

Table 1: The number of slides and patches in training, validation, and test sets for gland segmentation task. There is one H&E stained WSI for each prostatectomy or core needle biopsy specimen.

 

#Slides

 

 

 

 

Train

Valid

Test

Total

Prostatectomy

17

8

15

40

Biopsy

26

13

20

59

Total

43

21

35

99

 

#Patches

 

 

 

 

Train

Valid

Test

Total

Prostatectomy

7795

3753

7224

18772

Biopsy

5559

4028

5981

15568

Total

13354

7781

13205

34340

Table 2: The number of slides and patches in training, validation, and test sets for gland classification task. There is one H&E stained WSI for each prostatectomy or core needle biopsy specimen. The gland classification datasets are the subsets of the gland segmentation datasets. GS: Gleason Score. B: Benign. M: Malignant.

 

#Slides (GS  3+3:3+4:4+3)

 

 

 

 

Train

Valid

Test

Total

Biopsy

10:9:1

3:7:0

6:10:0

19:26:1

 

#Patches (B:M)

 

 

 

 

Train

Valid

Test

Total

Biopsy

1557:2277

1216:1341

1543:2718

4316:6336

NB: Gland classification folder (gland_classification_dataset.zip) may contain extra patches, labels of which could not be identified from H&E slides. They were not used in the machine learning study.

Notes

This study was funded by the Biomedical Research Council of the Agency for Science, Technology and Research, Singapore.

Files

gland_classification_dataset.zip

Files (47.0 GB)

Name Size Download all
md5:19f5031c6e814dc26e6d0f9be1f89247
27.3 GB Preview Download
md5:6006209dd1df1e4e323ace69ec4fd9e5
19.7 GB Preview Download