Dataset of histopathological image crops from GTEx project
Description
This is a dataset of histological slides from the GTEx project that has been balanced for 3 major factors (organ, sex, and age bracket) that may be useful to train models in supervised or self-supervised modes.
Four datasets are avaialble:
gtex_histology_balanced_3_slides_200_tiles.tar.gz
: Conditioned on the 3 factors, 3 slides were selected per group, and 200 tiles in tissue segmented areas selected randomly per slide.gtex_histology_balanced_3_slides_2000_tiles.tar.gz
: Conditioned on the 3 factors, 3 slides were selected per group, and 2000 tiles in tissue segmented areas selected randomly per slide.gtex_histology_balanced_10_slides_100_tiles.tar.gz
: Conditioned on the 3 factors, 10 slides were selected per group (when possible), and 100 tiles in tissue segmented areas selected randomly per slide. This dataset matches closely the "gtex_histology_balanced_3_slides_200_tiles.tar.gz" dataset in total number of tiles.gtex_histology_balanced_10_slides_800_tiles.tar.gz
: Conditioned on the 3 factors, 10 slides were selected per group (when possible), and 800 tiles in tissue segmented areas selected randomly per slide. This dataset matches closely the "gtex_histology_balanced_3_slides_200_tiles.tar.gz" dataset in total number of tiles.
Each archive file contains the following:
slide_annotation.csv
: a slide-level annotation of the slides (see below)train
: a directory with image tiles to be used to train a modelvalid
: a directory with image tiles to be used to validate a model
The slide_annotation file contains publicly available information on the slides in addition to 3 columns:
- "Tissue_simple": the organ of the slide
- "split": whether the slide was assign the 'train' or 'valid' split for training. The validation split slides have 1/10th of the tiles from training.
- "n_tiles": the number of image tiles in the dataset for each slide
Example:
Tissue Sample ID | Tissue | Subject ID | Sex | Age Bracket | Hardy Scale | Pathology Categories | Pathology Notes | Tissue_simple | split | n_tiles |
GTEX-1128S-1426 | Esophagus - Mucosa | GTEX-1128S | female | 60-69 | Fast death - natural causes | 6 pieces, near- total autolysis/mucosa completely sloughed | Esophagus | train | 200 | |
GTEX-113JC-1226 | Stomach | GTEX-113JC | female | 50-59 | Fast death - natural causes | 6 pieces, well dissected mucosa; some areas are severely autolyzed | Stomach | valid | 20 | |
GTEX-1192W-2526 | Muscle - Skeletal | GTEX-1192W | male | 60-69 | Fast death - natural causes | 2 pieces, ~10-20% interstitial fat, rep foci delineated | Muscle | train | 200 | |
GTEX-1192X-0426 | Muscle - Skeletal | GTEX-1192X | male | 50-59 | Slow death | 2 pieces, 5-10% interstitial fat, rep. foci delineated | Muscle | valid | 20 | |
GTEX-11DXX-1326 | Stomach | GTEX-11DXX | female | 60-69 | Ventilator case | gastritis | 6 pieces, mild chronic active gastritis | Stomach | train | 200 |
Inside train
and valid
and JPEG files named with the following convention: <Tissue Sample ID>.<Tissue_simple>.<Sex>.<Age Bracket>.<Y position>.<X position>.jpg
such that the origin of the crops can be traced and the file name serve as a direct class label if desired.
Examples: "GTEX-ZYT6-1326.Pancreas.male.30-39.47492.16064.jpg", "GTEX-WWYW-2726.Ovary.female.50-59.5024.15008.jpg.
Files
Files
(28.5 GB)
Name | Size | Download all |
---|---|---|
md5:be04549f4ce456e0cebd6876881e48cd
|
1.8 GB | Download |
md5:1d898f3c47e74180507c431805aafbc3
|
14.5 GB | Download |
md5:2b4612cf834b712c1f2f9f7293345975
|
10.9 GB | Download |
md5:61b2dc6aa35fb72c075bafe36a3584de
|
1.2 GB | Download |
Additional details
Related works
- Compiles
- Journal article: 10.1038/ng.2653 (DOI)
- Journal article: 10.1126/science.aaz1776 (DOI)
- Is part of
- Journal article: 10.1038/ng.2653 (DOI)
- Journal article: 10.1126/science.aaz1776 (DOI)
References
- GTEx Consortium. Lonsdale J., Thomas J., Salvatore M., Phillips R., Lo E., Shad S., Hasz R., Walters G., Garcia F., Young N., Foster B., Moser M., Karasik E., Gillard B., Ramsey K., Sullivan S., Bridge J., Magazine H., Syron J., Fleming J., Siminoff L., Traino H., Mosavel M., Barker L., Jewell S., Rohrer D., Maxim D., Filkins D., Harbach P., Cortadillo E., Berghuis B., Turner L., Hudson E., Feenstra K., Sobin L., Robb J., Branton P., Korzeniewski G., Shive C., Tabor D., Qi L., Groch K., Nampally S., Buia S., Zimmerman A., Smith A., Burges R., Robinson K., Valentino K., Bradbury D., Cosentino M., Diaz-Mayoral N., Kennedy M., Engel T., Williams P., Erickson K., Ardlie K., Winckler W., Getz G., DeLuca D., MacArthur D., Kellis M., Thomson A., Young T., Gelfand E., Donovan M., Meng Y., Grant G., Mash D., Marcus Y., Basile M., Liu J., Zhu J., Tu Z., Cox N. J., Nicolae D. L., Gamazon E. R., Im H. K., Konkashbaev A., Pritchard J., Stevens M., Flutre T., Wen X., Dermitzakis E. T., Lappalainen T., Guigo R., Monlong J., Sammeth M., Koller D., Battle A., Mostafavi S., McCarthy M., Rivas M., Maller J., Rusyn I., Nobel A., Wright F., Shabalin A., Feolo M., Sharopova N., Sturcke A., Paschal J., Anderson J. M., Wilder E. L., Derr L. K., Green E. D., Struewing J. P., Temple G., Volpi S., Boyer J. T., Thomson E. J., Guyer M. S., Ng C., Abdallah A., Colantuoni D., Insel T. R., Koester S. E., Little A. R., Bender P. K., Lehner T., Yao Y., Compton C. C., Vaught J. B., Sawyer S., Lockhart N. C., Demchok J., Moore H. F., Nat. Genet. 45, 580–585 (2013). doi:10.1038/ng.2653
- GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020) doi:10.1126/science.aaz1776