Published August 9, 2022 | Version 2022-08-09
Dataset Open

TCGA HDF file pipeline

  • 1. University of Wisconsin - Madison

Description

HDF files containing data from The Cancer Genome Atlas (TCGA).

https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga

Please see the Broad Institute's TCGA data usage policy: https://broadinstitute.atlassian.net/wiki/spaces/GDAC/pages/844333156/Data+Usage+Policy

 

The HDF files were generated by the code in this repository: https://github.com/dpmerrell/tcga-pipeline

* tcga_omic.tar.gz contains multi-omic data for 10,000+ patients. This includes copy number variation, somatic mutation, methylation, gene expression, and RPPA data.

* tcga_clinical.tar.gz contains clinical annotations for those same patients. E.g., age, sex, survival, smoking.

See https://github.com/dpmerrell/tcga-pipeline/blob/main/README.md for more information about the data and its layout in the HDF5 files.


Version notes:

2022-08-09: Fixed some bugs in string formatting. (Pipeline updated on this date; data uploaded on 2022-09-26 due to Zenodo technical issues.)

2021-12-06: **Significant changes**. `tcga_omic.hdf` is organized very differently. It also includes more kinds of data (a) somatic mutation data and (b) full TCGA barcodes for each patient and omic type (useful for extracting batch information).

2021-03-17: improved the naming convention for RPPA data features: {GENE}_{ANTIBODY}_rppa

2021-02-28: improved HDF file format. We provide one big matrix of data, rather than one matrix per cancer type. Cancer type is indicated by a vector (key="cancer_types"). Updated the Omic and Clinical HDFs accordingly.

2021-02-01: added mutation annotation scores. removed GRSN from RPPA pipeline.

2021-01-24: removed redundant/combination datasets (COADREAD, STES, GBMLGG, KIPAN). Applied Global Rank-Invariant Set normalization (GRSN) to RPPA data.

Notes

Fixed some bugs in string formatting.

Files

Files (1.8 GB)

Name Size Download all
md5:97b846fcf6dfcd3de85dc6100a191f1c
36.3 MB Download
md5:60d0918f534cd6e8fd6de5a40fd974c8
1.8 GB Download