Published August 16, 2024 | Version v1
Dataset Open

Dataset of histopathological image crops from GTEx project

  • 1. ROR icon CeMM Research Center for Molecular Medicine

Description

This is a dataset of histological slides from the GTEx project that has been balanced for 3 major factors (organ, sex, and age bracket) that may be useful to train models in supervised or self-supervised modes.

Four datasets are avaialble:

  • gtex_histology_balanced_3_slides_200_tiles.tar.gz: Conditioned on the 3 factors, 3 slides were selected per group, and 200 tiles in tissue segmented areas selected randomly per slide.
  • gtex_histology_balanced_3_slides_2000_tiles.tar.gz: Conditioned on the 3 factors, 3 slides were selected per group, and 2000 tiles in tissue segmented areas selected randomly per slide.
  • gtex_histology_balanced_10_slides_100_tiles.tar.gz: Conditioned on the 3 factors, 10 slides were selected per group (when possible), and 100 tiles in tissue segmented areas selected randomly per slide. This dataset matches closely the "gtex_histology_balanced_3_slides_200_tiles.tar.gz" dataset in total number of tiles.
  • gtex_histology_balanced_10_slides_800_tiles.tar.gz: Conditioned on the 3 factors, 10 slides were selected per group (when possible), and 800 tiles in tissue segmented areas selected randomly per slide. This dataset matches closely the "gtex_histology_balanced_3_slides_200_tiles.tar.gz" dataset in total number of tiles.

Each archive file contains the following:

  • slide_annotation.csv: a slide-level annotation of the slides (see below)
  • train: a directory with image tiles to be used to train a model
  • valid: a directory with image tiles to be used to validate a model

The slide_annotation file contains publicly available information on the slides in addition to 3 columns:

  • "Tissue_simple": the organ of the slide
  • "split": whether the slide was assign the 'train' or 'valid' split for training. The validation split slides have 1/10th of the tiles from training.
  • "n_tiles": the number of image tiles in the dataset for each slide

Example:

Tissue Sample ID Tissue Subject ID Sex Age Bracket Hardy Scale Pathology Categories Pathology Notes Tissue_simple split n_tiles
GTEX-1128S-1426 Esophagus - Mucosa GTEX-1128S female 60-69 Fast death - natural causes   6 pieces, near- total autolysis/mucosa completely sloughed Esophagus train 200
GTEX-113JC-1226 Stomach GTEX-113JC female 50-59 Fast death - natural causes   6 pieces, well dissected mucosa; some areas are severely autolyzed Stomach valid 20
GTEX-1192W-2526 Muscle - Skeletal GTEX-1192W male 60-69 Fast death - natural causes   2 pieces, ~10-20% interstitial fat, rep foci delineated Muscle train 200
GTEX-1192X-0426 Muscle - Skeletal GTEX-1192X male 50-59 Slow death   2 pieces, 5-10% interstitial fat, rep. foci delineated Muscle valid 20
GTEX-11DXX-1326 Stomach GTEX-11DXX female 60-69 Ventilator case gastritis 6 pieces, mild chronic active gastritis Stomach train 200

Inside train and valid and JPEG files named with the following convention: <Tissue Sample ID>.<Tissue_simple>.<Sex>.<Age Bracket>.<Y position>.<X position>.jpg such that the origin of the crops can be traced and the file name serve as a direct class label if desired.

Examples: "GTEX-ZYT6-1326.Pancreas.male.30-39.47492.16064.jpg", "GTEX-WWYW-2726.Ovary.female.50-59.5024.15008.jpg.

Files

Files (28.5 GB)

Name Size Download all
md5:be04549f4ce456e0cebd6876881e48cd
1.8 GB Download
md5:1d898f3c47e74180507c431805aafbc3
14.5 GB Download
md5:2b4612cf834b712c1f2f9f7293345975
10.9 GB Download
md5:61b2dc6aa35fb72c075bafe36a3584de
1.2 GB Download

Additional details

Related works

Compiles
Journal article: 10.1038/ng.2653 (DOI)
Journal article: 10.1126/science.aaz1776 (DOI)
Is part of
Journal article: 10.1038/ng.2653 (DOI)
Journal article: 10.1126/science.aaz1776 (DOI)

References

  • GTEx Consortium. Lonsdale J., Thomas J., Salvatore M., Phillips R., Lo E., Shad S., Hasz R., Walters G., Garcia F., Young N., Foster B., Moser M., Karasik E., Gillard B., Ramsey K., Sullivan S., Bridge J., Magazine H., Syron J., Fleming J., Siminoff L., Traino H., Mosavel M., Barker L., Jewell S., Rohrer D., Maxim D., Filkins D., Harbach P., Cortadillo E., Berghuis B., Turner L., Hudson E., Feenstra K., Sobin L., Robb J., Branton P., Korzeniewski G., Shive C., Tabor D., Qi L., Groch K., Nampally S., Buia S., Zimmerman A., Smith A., Burges R., Robinson K., Valentino K., Bradbury D., Cosentino M., Diaz-Mayoral N., Kennedy M., Engel T., Williams P., Erickson K., Ardlie K., Winckler W., Getz G., DeLuca D., MacArthur D., Kellis M., Thomson A., Young T., Gelfand E., Donovan M., Meng Y., Grant G., Mash D., Marcus Y., Basile M., Liu J., Zhu J., Tu Z., Cox N. J., Nicolae D. L., Gamazon E. R., Im H. K., Konkashbaev A., Pritchard J., Stevens M., Flutre T., Wen X., Dermitzakis E. T., Lappalainen T., Guigo R., Monlong J., Sammeth M., Koller D., Battle A., Mostafavi S., McCarthy M., Rivas M., Maller J., Rusyn I., Nobel A., Wright F., Shabalin A., Feolo M., Sharopova N., Sturcke A., Paschal J., Anderson J. M., Wilder E. L., Derr L. K., Green E. D., Struewing J. P., Temple G., Volpi S., Boyer J. T., Thomson E. J., Guyer M. S., Ng C., Abdallah A., Colantuoni D., Insel T. R., Koester S. E., Little A. R., Bender P. K., Lehner T., Yao Y., Compton C. C., Vaught J. B., Sawyer S., Lockhart N. C., Demchok J., Moore H. F., Nat. Genet. 45, 580–585 (2013). doi:10.1038/ng.2653
  • GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020) doi:10.1126/science.aaz1776