Digital Pathology Dataset for Prostate Cancer Diagnosis
Creators
- 1. Bioinformatics Institute, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- 2. Department of Pathology, Tan Tock Seng Hospital, Singapore, Singapore
- 3. School of Computing, National University of Singapore, Singapore, Singapore
Description
Links to code and bioRxiv pre-print:
1. Multi-lens Neural Machine (MLNM) Code
2. An AI-assisted Tool For Efficient Prostate Cancer Diagnosis (bioRxiv Pre-print)
Digitized hematoxylin and eosin (H&E)-stained whole-slide-images (WSIs) of 40 prostatectomy and 59 core needle biopsy specimens were collected from 99 prostate cancer patients at Tan Tock Seng Hospital, Singapore. There were 99 WSIs in total such that each specimen had one WSI. H&E-stained slides were scanned at 40× magnification (specimen-level pixel size 0·25μm × 0·25μm) using Aperio AT2 Slide Scanner (Leica Biosystems). Institutional board review from the hospital were obtained for this study, and all the data were de-identified.
Prostate glandular structures in core needle biopsy slides were manually annotated and classified using the ASAP annotation tool (ASAP). A senior pathologist reviewed 10% of the annotations in each slide, ensuring that some reference annotations were provided to the researcher at different regions of the core. It is to be noted that partial glands appearing at the edges of the biopsy cores were not annotated.
Patches of size 512 × 512 pixels were cropped from whole slide images at resolutions 5×, 10×, 20×, and 40× with an annotated gland centered at each patch. This dataset contains these cropped images.
This dataset is used to train two AI models for Gland Segmentation (99 patients) and Gland Classification (46 patients). Tables 1 and 2 illustrate both gland segmentation and gland classification datasets. We have put the two corresponding sub-datasets as two zip files as follows:
- gland_segmentation_dataset.zip
- gland_classification_dataset.zip
Table 1: The number of slides and patches in training, validation, and test sets for gland segmentation task. There is one H&E stained WSI for each prostatectomy or core needle biopsy specimen.
|
#Slides |
|
|
|
|
Train |
Valid |
Test |
Total |
Prostatectomy |
17 |
8 |
15 |
40 |
Biopsy |
26 |
13 |
20 |
59 |
Total |
43 |
21 |
35 |
99 |
|
#Patches |
|
|
|
|
Train |
Valid |
Test |
Total |
Prostatectomy |
7795 |
3753 |
7224 |
18772 |
Biopsy |
5559 |
4028 |
5981 |
15568 |
Total |
13354 |
7781 |
13205 |
34340 |
Table 2: The number of slides and patches in training, validation, and test sets for gland classification task. There is one H&E stained WSI for each prostatectomy or core needle biopsy specimen. The gland classification datasets are the subsets of the gland segmentation datasets. GS: Gleason Score. B: Benign. M: Malignant.
|
#Slides (GS 3+3:3+4:4+3) |
|
|
|
|
Train |
Valid |
Test |
Total |
Biopsy |
10:9:1 |
3:7:0 |
6:10:0 |
19:26:1 |
|
#Patches (B:M) |
|
|
|
|
Train |
Valid |
Test |
Total |
Biopsy |
1557:2277 |
1216:1341 |
1543:2718 |
4316:6336 |
NB: Gland classification folder (gland_classification_dataset.zip) may contain extra patches, labels of which could not be identified from H&E slides. They were not used in the machine learning study.
Notes
Files
gland_classification_dataset.zip
Files
(47.0 GB)
Name | Size | Download all |
---|---|---|
md5:19f5031c6e814dc26e6d0f9be1f89247
|
27.3 GB | Preview Download |
md5:6006209dd1df1e4e323ace69ec4fd9e5
|
19.7 GB | Preview Download |