TCGA HDF file pipeline
Description
HDF files containing data from The Cancer Genome Atlas (TCGA).
https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga
Please see the Broad Institute's TCGA data usage policy: https://broadinstitute.atlassian.net/wiki/spaces/GDAC/pages/844333156/Data+Usage+Policy
The HDF files were generated by the code in this repository: https://github.com/dpmerrell/tcga-pipeline
* tcga_omic.tar.gz contains multi-omic data for 10,000+ patients. This includes copy number variation, somatic mutation, methylation, gene expression, and RPPA data.
* tcga_clinical.tar.gz contains clinical annotations for those same patients. E.g., age, sex, survival, smoking.
See https://github.com/dpmerrell/tcga-pipeline/blob/main/README.md for more information about the data and its layout in the HDF5 files.
Version notes:
2022-08-09: Fixed some bugs in string formatting. (Pipeline updated on this date; data uploaded on 2022-09-26 due to Zenodo technical issues.)
2021-12-06: **Significant changes**. `tcga_omic.hdf` is organized very differently. It also includes more kinds of data (a) somatic mutation data and (b) full TCGA barcodes for each patient and omic type (useful for extracting batch information).
2021-03-17: improved the naming convention for RPPA data features: {GENE}_{ANTIBODY}_rppa
2021-02-28: improved HDF file format. We provide one big matrix of data, rather than one matrix per cancer type. Cancer type is indicated by a vector (key="cancer_types"). Updated the Omic and Clinical HDFs accordingly.
2021-02-01: added mutation annotation scores. removed GRSN from RPPA pipeline.
2021-01-24: removed redundant/combination datasets (COADREAD, STES, GBMLGG, KIPAN). Applied Global Rank-Invariant Set normalization (GRSN) to RPPA data.
Notes
Files
Files
(1.8 GB)
Name | Size | Download all |
---|---|---|
md5:97b846fcf6dfcd3de85dc6100a191f1c
|
36.3 MB | Download |
md5:60d0918f534cd6e8fd6de5a40fd974c8
|
1.8 GB | Download |