Published January 19, 2021
| Version 1.0
Dataset
Open
Histology images from uniform tumor regions in TCGA Whole Slide Images (TCGA-UT)
Description
TCGA-UT Dataset Documentation
Quick Links
- Dataset on Hugging Face: For users interested in benchmarking foundation models or feature extractors, please visit TCGA-UT on Hugging Face
- Original Paper: Universal encoding of pan-cancer histology by deep texture representations
Dataset Overview
The TCGA-UT dataset is a large-scale collection of histopathological image patches from human cancer tissues. It contains 1,608,060 image patches extracted from hematoxylin & eosin (H&E) stained histological samples across 32 different types of solid cancers.
Key Features
- Size: Over 1.6 million image patches
- Resolution: All patches are standardized to 256 x 256 pixels
- Source: Derived from The Cancer Genome Atlas (TCGA) dataset
- Quality: Curated by trained pathologists
- Coverage: 32 different cancer types
- Patient Base: 7,175 patients from 8,736 diagnostic slides
Data Collection Process
- Image Source: Whole Slide Images (WSI) were downloaded from the GDC legacy database between December 2016 and June 2017
- Expert Annotation: Two trained pathologists selected at least three representative tumor regions per slide
- Quality Control: 926 slides were removed due to various quality issues (poor staining, low resolution, focus problems, etc.)
- Patch Extraction: 10 patches were randomly cropped at 6 different magnification levels from each annotated region
File Structure
Files are organized using the following format:
Copy
[cancer_type]/[resolution]/[TCGA Barcode]/[region]-[number]-[pixel resolution].jpg
Resolution Key
- 0: 0.5 μm/pixel
- 1: 0.6 μm/pixel
- 2: 0.7 μm/pixel
- 3: 0.8 μm/pixel
- 4: 0.9 μm/pixel
- 5: 1.0 μm/pixel
License
- Non-Commercial Use: CC-BY-NC-SA 4.0
- Commercial Use: Please contact ishum-prm@m.u-tokyo.ac.jp for licensing
Citation
If you use this dataset in your research, please cite:
Copy
Komura, D., et al. (2022). Universal encoding of pan-cancer histology by deep texture representations.
Cell Reports 38, 110424. https://doi.org/10.1016/j.celrep.2022.110424
For Model Benchmarking
If you're interested in using this dataset for benchmarking foundation models or feature extractors, we recommend accessing the dataset through the Hugging Face Hub at dakomura/tcga-ut. The Hugging Face version provides:
- Predefined train/validation/test splits (both internal and external facility-based splits)
- Ready-to-use benchmarking framework for foundation models
- WebDataset format support for efficient data loading
- Example implementations for state-of-the-art model evaluation
Files
Adrenocortical_carcinoma.zip
Files
(35.3 GB)
Name | Size | Download all |
---|---|---|
md5:180f5e9b1b5ee138367705b118acfefd
|
672.7 MB | Preview Download |
md5:c31a517074687dc1220c9eef163ce4b2
|
1.4 GB | Preview Download |
md5:de9ebbb3e60a56655aa490f88ed949ed
|
3.2 GB | Preview Download |
md5:94d5694d01fef7444a79041c5b1e51b8
|
3.1 GB | Preview Download |
md5:5918bfb858483b88eae4d50e1bb65d21
|
792.6 MB | Preview Download |
md5:3ec989c8ecd446c7a82ec2f31bbca19f
|
121.2 MB | Preview Download |
md5:c9a0ef460c1caf65f8fd45d16a908d56
|
1.0 GB | Preview Download |
md5:1a2fd729047c5e7ff78290b666abd500
|
426.7 MB | Preview Download |
md5:63f3e13d6b80e5d3ff3e4958845b6844
|
2.8 GB | Preview Download |
md5:f0efc1aa8195eae5d939a091a62622cf
|
1.4 GB | Preview Download |
md5:800643e1e700290585df6328d6c6b003
|
319.1 MB | Preview Download |
md5:49bbf1a0c087e00a734241bebab9fb96
|
1.7 GB | Preview Download |
md5:9fdba7d2764e5a640947a69eb70ab4e0
|
890.5 MB | Preview Download |
md5:704cb8559a5b5ac27cd19011dc39662f
|
314 Bytes | Download |
md5:8ee35b9d8c22f3cdb9ddc4f54aef680d
|
1.2 GB | Preview Download |
md5:66f33d49f267bbf4feaee93f63cb5b8e
|
2.1 GB | Preview Download |
md5:7949f641e1ac03c7554402ff088c48fc
|
2.1 GB | Preview Download |
md5:a6af17438563f33d7b2c6bad21e0618b
|
121.7 MB | Preview Download |
md5:a56c5731fb220064414f488d5fee52cc
|
254.0 MB | Preview Download |
md5:b0294f47f15a8f4359c632fe3f003f59
|
333.9 MB | Preview Download |
md5:9055fd63f0a0cd4fb245ce1b683b77d0
|
487.9 MB | Preview Download |
md5:54d8df72515b76c0b326b19aefb4c719
|
189.6 MB | Preview Download |
md5:48bdeab653e8152b22a919e67e35afd2
|
1.3 GB | Preview Download |
md5:2e685aa760e2e8d48ba66c993c30014b
|
223.8 MB | Preview Download |
md5:df3396c024862257f403d957bbc90f69
|
1.8 GB | Preview Download |
md5:f7d1ea80ebd90ad0796e491389b25d51
|
1.3 GB | Preview Download |
md5:7d0476e3e45ccccb933438d6bd65fcea
|
1.3 GB | Preview Download |
md5:87434d0edc53659be056c41c049629bd
|
808.6 MB | Preview Download |
md5:655a24e785b48ffc2cb79ab287df0192
|
524.9 MB | Preview Download |
md5:fe078a0bbd2d4e3ac4cf1e45bc003e79
|
1.4 GB | Preview Download |
md5:c1d4dffc1f2d50f3d5b3f2a0d2f55460
|
288.8 MB | Preview Download |
md5:684c1d013e500b254f3ba35510c34452
|
1.6 GB | Preview Download |
md5:d1dd6d3ebe5d4eccc6a0a293efe6482b
|
209.7 MB | Preview Download |