Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published June 8, 2022 | Version 1.0.0
Dataset Open

Dataset for tumor infiltrating lymphocyte classification (304,097 image patches from TCGA)

Description

This is a dataset of images with or without tumor-infiltrating lymphocytes (TILs). The original images are from Abousamra et al. (2022) and Saltz et al. (2018), and the original whole slide images are from TCGA. This dataset is a subset of the data presented in Abousamra et al. (2022) (with new data partitions).

If you use this dataset, please cite the following papers, as well as this Zenodo page.

Abousamra, S., Gupta, M. D., Hou, L., Batiste, R., Zhao, T., Shankar, A., Rao, A., Chen, C., Samaras, D., Kurc, T., & Saltz, J. (2022). Deep Learning-Based Mapping of Tumor Infiltrating Lymphocytes in Whole Slide Images of 23 Types of Cancer. Frontiers in Oncology, 5971. https://doi.org/10.3389/fonc.2021.806603

Saltz, J., Gupta, R., Hou, L., Kurc, T., Singh, P., Nguyen, V., Samaras, D., Shroyer, K. R., Zhao, T., Batiste, R., & Danilova, L. (2018). Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Reports, 23(1), 181-193.

 

The acknowledgements from the Frontiers in Oncology and Cell Reports papers are included below:

This work was supported by the National Institutes of Health (NIH) and National Cancer Institute (NCI) grants UH3-CA22502103, U24-CA21510904, 1U24CA180924-01A1, 3U24CA215109-02, and 1UG3CA225021-01 as well as generous private support from Bob Beals and Betsy Barton. AR and AS were partially supported by NCI grant R37-CA214955 (to AR), the University of Michigan (U-M) institutional research funds and also supported by ACS grant RSG-16-005-01 (to AR). AS was supported by the Biomedical Informatics & Data Science Training Grant (T32GM141746). This work was enabled by computational resources supported by National Science Foundation grant number ACI-1548562, providing access to the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center, and also a DOE INCITE award joint with the MENNDL team at the Oak Ridge National Laboratory, providing access to Summit high performance computing system. The funders were not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

 

We are grateful to all the patients and families who contributed to this study. Funding from the Cancer Research Institute is gratefully acknowledged, as is support from National Cancer Institute (NCI) through U54 HG003273, U54 HG003067, U54 HG003079, U24 CA143799, U24 CA143835, U24 CA143840, U24 CA143843, U24 CA143845,U24 CA143848, U24 CA143858, U24 CA143866, U24 CA143867, U24 CA143882, U24 CA143883, U24 CA144025, P30 CA016672, U24CA180924, U24CA210950, U24CA215109, NCI Contract HHSN261201400007C, and Leidos Biomedical Contract 14X138. A.U.K.R. and P.S were supported by CCSG Bioinformatics Shared Resource P30 CA01667, ITCR U24 Supplement 1U24CA199461-01, a gift from Agilent technologies, CPRIT RP150578, and a Research Scholar Grant from the American Cancer Society (RSG-16-005-01). This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation XSEDE Science Gateways program under grant ACI-1548562 allocation TG-ASC130023. The authors would like to thank Stony Brook Research Computing and Cyberinfrastructure and the Institute for Advanced Computational Science at Stony Brook University for access to the high-performance LIred and SeaWulf computing systems, the latter of which was supported by National Science Foundation grant (#1531492).

------------------------------------

This dataset includes 304,097 image patches. All images are 100 x 100 pixels at 0.5 micrometers per pixel. An image is TIL-positive if there are at least two TILs present.

Refer to `images-tcga-tils-metadata.csv` for information about each image. That spreadsheet has the following columns:

partition,study,barcode,label,path,md5

Partition specifies which partition the image is part of (train, val, test). Study is the TCGA study the image is part of (e.g., acc for TCGA-ACC). Barcode is the TCGA participant barcode. This is used during partitioning, to ensure that images from the same participant are not present in different data partitions. Label is either til-negative or til-positive. An image is til-positive if there are at least two TILs in the image. Path is the path to the PNG image. All images are stored as PNG. Md5 is the md5 hash of the image. This can be used to ensure there are no duplicate images and to verify the integrity of images.

There are study-specific directories in the directory `images-tcga-tils`, and there is a directory named `pancancer` that includes images from all the included TCGA studies. That directory uses symlinks to avoid storing duplicate data.

 

Files

Files (6.9 GB)

Name Size Download all
md5:42d387c2f8d300dea8bb8a544b43ad3c
6.9 GB Download

Additional details

Related works

Is part of
Journal article: 10.3389/fonc.2021.806603 (DOI)
Journal article: 10.1016/j.celrep.2018.03.086 (DOI)