Published December 25, 2024 | Version 0.1

HISTOPANTUME: Histological Pan-cancer Tumor image dataset

Description

 

HISTOPANTUM is a comprehensive pan-cancer dataset of histology images categorized into Tumor and Non-Tumor classes over 4 different cancer types (domains). This dataset is designed to facilitate domain generalization analysis for tumor detection tasks, serving as a benchmark for foundation models and domain generalization algorithms.

Dataset Overview

The dataset comprises histology images sourced from The Cancer Genome Atlas (TCGA), spanning the following four cancer types:

  • Colorectal Cancer
  • Ovarian Cancer
  • Stomach Cancer
  • Uterus Cancer

Image Specifications

  • Original Resolution: 512 × 512 pixels images are extracted from 0.5 micron-per-pixel resolution.
  • Processed Size: Images are resized to 224 × 224 pixels and saved as JPEG files.

The dataset is provided in four zipped files, each corresponding to one cancer type. Within each zip file, images are organized into two subfolders:

  • tumour
  • non-tumour

Each image filename encodes the originating slide and the patch position within the slide, following this naming convention:

<TCGA-slide-name>_<x-pos>_<y-pos>.jpg

Citation

If you use this dataset in your research, please cite the following publication:

@article{zamanitajeddin2024benchmarking,
  title={Benchmarking Domain Generalization Algorithms in Computational Pathology},
  author={Zamanitajeddin, Neda and Jahanifar, Mostafa and Xu, Kesi and Siraj, Fouzia and Rajpoot, Nasir},
  journal={arXiv preprint arXiv:2409.17063},
  year={2024}
}

For further details, please refer to the linked publication.

Files

Files (2.9 GB)

Name Size
md5:a1e86f73595fe7533209a5bf19c1ef88
536.9 MB Download
md5:37d975de8e107e1dcccb7afa6ae7a448
698.4 MB Download
md5:8201304f0db67f8bd95ba7fbf0059776
592.3 MB Download
md5:e35ba1f9beea459bfd28c33fa6bdf16b
1.1 GB Download