Published January 19, 2021 | Version 1.0
Dataset Open

Histology images from uniform tumor regions in TCGA Whole Slide Images (TCGA-UT)

  • 1. The University of Tokyo

Description

TCGA-UT Dataset Documentation 

Quick Links

 

Dataset Overview

The TCGA-UT dataset is a large-scale collection of histopathological image patches from human cancer tissues. It contains 1,608,060 image patches extracted from hematoxylin & eosin (H&E) stained histological samples across 32 different types of solid cancers.

Key Features

  • Size: Over 1.6 million image patches
  • Resolution: All patches are standardized to 256 x 256 pixels
  • Source: Derived from The Cancer Genome Atlas (TCGA) dataset
  • Quality: Curated by trained pathologists
  • Coverage: 32 different cancer types
  • Patient Base: 7,175 patients from 8,736 diagnostic slides

Data Collection Process

  1. Image Source: Whole Slide Images (WSI) were downloaded from the GDC legacy database between December 2016 and June 2017
  2. Expert Annotation: Two trained pathologists selected at least three representative tumor regions per slide
  3. Quality Control: 926 slides were removed due to various quality issues (poor staining, low resolution, focus problems, etc.)
  4. Patch Extraction: 10 patches were randomly cropped at 6 different magnification levels from each annotated region

File Structure

Files are organized using the following format:

 
Copy
[cancer_type]/[resolution]/[TCGA Barcode]/[region]-[number]-[pixel resolution].jpg

Resolution Key

  • 0: 0.5 μm/pixel
  • 1: 0.6 μm/pixel
  • 2: 0.7 μm/pixel
  • 3: 0.8 μm/pixel
  • 4: 0.9 μm/pixel
  • 5: 1.0 μm/pixel

License

Citation

If you use this dataset in your research, please cite:

 
Copy
Komura, D., et al. (2022). Universal encoding of pan-cancer histology by deep texture representations. Cell Reports 38, 110424. https://doi.org/10.1016/j.celrep.2022.110424

For Model Benchmarking

If you're interested in using this dataset for benchmarking foundation models or feature extractors, we recommend accessing the dataset through the Hugging Face Hub at dakomura/tcga-ut. The Hugging Face version provides:

  • Predefined train/validation/test splits (both internal and external facility-based splits)
  • Ready-to-use benchmarking framework for foundation models
  • WebDataset format support for efficient data loading
  • Example implementations for state-of-the-art model evaluation

 

Files

Adrenocortical_carcinoma.zip

Files (35.3 GB)

Name Size Download all
md5:180f5e9b1b5ee138367705b118acfefd
672.7 MB Preview Download
md5:c31a517074687dc1220c9eef163ce4b2
1.4 GB Preview Download
md5:de9ebbb3e60a56655aa490f88ed949ed
3.2 GB Preview Download
md5:94d5694d01fef7444a79041c5b1e51b8
3.1 GB Preview Download
md5:5918bfb858483b88eae4d50e1bb65d21
792.6 MB Preview Download
md5:3ec989c8ecd446c7a82ec2f31bbca19f
121.2 MB Preview Download
md5:c9a0ef460c1caf65f8fd45d16a908d56
1.0 GB Preview Download
md5:1a2fd729047c5e7ff78290b666abd500
426.7 MB Preview Download
md5:63f3e13d6b80e5d3ff3e4958845b6844
2.8 GB Preview Download
md5:f0efc1aa8195eae5d939a091a62622cf
1.4 GB Preview Download
md5:800643e1e700290585df6328d6c6b003
319.1 MB Preview Download
md5:49bbf1a0c087e00a734241bebab9fb96
1.7 GB Preview Download
md5:9fdba7d2764e5a640947a69eb70ab4e0
890.5 MB Preview Download
md5:704cb8559a5b5ac27cd19011dc39662f
314 Bytes Download
md5:8ee35b9d8c22f3cdb9ddc4f54aef680d
1.2 GB Preview Download
md5:66f33d49f267bbf4feaee93f63cb5b8e
2.1 GB Preview Download
md5:7949f641e1ac03c7554402ff088c48fc
2.1 GB Preview Download
md5:a6af17438563f33d7b2c6bad21e0618b
121.7 MB Preview Download
md5:a56c5731fb220064414f488d5fee52cc
254.0 MB Preview Download
md5:b0294f47f15a8f4359c632fe3f003f59
333.9 MB Preview Download
md5:9055fd63f0a0cd4fb245ce1b683b77d0
487.9 MB Preview Download
md5:54d8df72515b76c0b326b19aefb4c719
189.6 MB Preview Download
md5:48bdeab653e8152b22a919e67e35afd2
1.3 GB Preview Download
md5:2e685aa760e2e8d48ba66c993c30014b
223.8 MB Preview Download
md5:df3396c024862257f403d957bbc90f69
1.8 GB Preview Download
md5:f7d1ea80ebd90ad0796e491389b25d51
1.3 GB Preview Download
md5:7d0476e3e45ccccb933438d6bd65fcea
1.3 GB Preview Download
md5:87434d0edc53659be056c41c049629bd
808.6 MB Preview Download
md5:655a24e785b48ffc2cb79ab287df0192
524.9 MB Preview Download
md5:fe078a0bbd2d4e3ac4cf1e45bc003e79
1.4 GB Preview Download
md5:c1d4dffc1f2d50f3d5b3f2a0d2f55460
288.8 MB Preview Download
md5:684c1d013e500b254f3ba35510c34452
1.6 GB Preview Download
md5:d1dd6d3ebe5d4eccc6a0a293efe6482b
209.7 MB Preview Download