Published March 17, 2023 | Version v2
Dataset Open

Intergenic RNAPII Atlas : output data

  • 1. Aix Marseille Univ, INSERM, TAGC, Marseille, France
  • 2. Aix Marseille Univ, INSERM, TAGC, Marseille, France, Aix Marseille Univ, CNRS, INSERM, CIML, Marseille, France

Description

This dataset represents the RNAPII (RNAP2) Atlas of potentially transcribed intergenic regions of the human genome by integrating 906 high quality human Chromatin-ImmunoPrecipitation sequencing (ChIP-seq) biosamples targeting the RNA Polymerase II, obtained from public data warehouses. 

Github Code available here :  https://github.com/benoitballester/Pol2Atlas 

The dataset consists of 5 zipped folders described below: 

./pol2_consensuses/:
    consensuses.bed:
        Location of intergenic RNAP2 consensuses in bed format for the hg38 assembly. 
        First three columns are genomic locations, 4th column is consensus ID, 
        5th column is the number of datasets with RNAP2 observed at this RNAP2 consensus,
        6th column is strand (not used), 7-8th columns is consensus centroid.
    consensusesHg19.bed:
        Location of intergenic RNAP2 consensuses in hg19 assembly. ~1000 are missing due to liftover.
        Consensus ID is matching with the hg38 one.
    matrix.mtx:
        RNAP2 occupancy consensus-dataset binary matrix in sparse matrix market format.
        Corresponding row annotation are RNAP2 consensuses.
        Corresponding column annotation are datasets stored in dataset.txt.
    datasets.txt:
        See matrix.mtx
    clusterConsensuses_Labels.txt:
        Assigned cluster for each RNAP2 consensus.
    intersectIntergPol2.tsv:
        RNAP2 consensuses with cluster ID and intersections with reference databases.
    cluster_bed/:
        consensuses.bed splitted per cluster.
    saf_files/:
        Files typically used for read counting with featureCounts. Suffixes:
        _500 : RNAP2 consensuses standardized to 1kbp.
        _all : All RNAP2 consensuses including genic.
        Hg19 : Intergenic RNAP2 Lifted to Hg19.

 

./rnap2_all_peaks/:
    all_peaks.bed.gz:
        Concatenated bed file with all POLR2A peaks from all experiments, genome wide, for the hg38 assembly. 
        Peaks are filtered with a MACS2 qvalue > 1e-5, datasets with less than 100 peaks in intergenic regions are removed.
        First three columns are genomic locations, 4th column contains sample of origin of the peak, 5th column is the
        MACS2 q-value, 6th column is dna strand (not used), 7-8th are peak "summit". 9th column contains an r,g,b value
        corresponding to the biotype of origin (Blood / Immune, Brain, Embryo...) for easy visualization in a genome browser.
        Legend is available in legend.png. Conversion table between rgb values and biotype in palette.csv.
        Note that singletons are removed when creating consensus peaks.
    all_peaks_interg.bed.gz:
        Same as above, but for intergenic regions only (excluding 1kb before TSS and 1kb after TES).

 

./count_tables_rnaseq/:
    ENCODE/:
        counts.mtx.gz:
            Count table in sparse matrix market format. Row corresponds to samples, columns to Pol II probes (Pol2_500.saf).
        samples.csv.gz:
            Matching row annotation for count matrix.
        encode_total_rnaseq_annot.tsv.gz:
            Sample annotation (not ordered!).
    GTEx/:
        counts.mtx.gz:
            Count table in sparse matrix market format. Row corresponds to samples, columns to Pol II probes (Pol2_500.saf).
        samples.csv.gz:
            Matching row annotation for count matrix.
        sample_annot.tsv.gz:
            Sample annotation (not ordered!).
    TCGA/:
        counts.mtx.gz:
            Count table in sparse matrix market format. Row corresponds to samples, columns to Pol II probes (Pol2_500.saf).
        samples.csv.gz:
            Matching row annotation for count matrix.
        annotation_table.tsv.gz:
            Sample annotation (not ordered!).
        


./cancer_markers/:
    bed/:
        DE_Tumor_vs_Normal/:
            TCGA-*/:
                allWithStats.bed:
                    FDR, mean difference in pearson residuals and log2(FC) for each RNAP2 probe. Warning: probes are prefiltered to have > 1 read in 3 samples, make sure to use row index to match 
                    with RNAP2 consensuses.
                allDE.bed:
                    All DE (cancer vs normal) probes in bed format for this cancer. 
                    5th column has been replaced by enrichment p-value.
                DE_downreg.bed:
                    Downregulated (cancer vs normal) probes in bed format for this cancer. 
                    5th column has been replaced by enrichment p-value.
                DE_upreg.bed:
                    Upregulated (cancer vs normal) probes in bed format for this cancer. 
                    5th column has been replaced by enrichment p-value.
                classifier_TCGA-*:
                    Performance of a machine learning tumor-normal tissues classifier using Pol II probes as input.
            globally_DE.bed:
                Probes DE in 7+ cancers (FPR permutation threshold). Last column indicates the number of cancers this probe is DE in.
            globally_Down regulated.bed:
                Probes DE in 6+ cancers (FPR permutation threshold). Last column indicates the number of cancers this probe is Down regulated in.
            globally_Up regulated.bed:
                Probes DE in 5+ cancers (FPR permutation threshold). Last column indicates the number of cancers this probe is Up regulated in.
        subtypes/:
            BRCA/:
                allWithStats_BRCA.*.bed:
                    FDR, mean difference in pearson residuals and log2(FC) for each RNAP2 probe for DE test of sample from this subtype against normal samples. 
            Warning: probes are prefiltered to have > 1 read in 3 samples, make sure to use row index to match 
                    with RNAP2 consensuses.
                bed_BRCA.*.bed:
                    All DE (subtype vs normal) probes in bed format for this cancer. 
                bed_uniqueDE_BRCA.*.bed:
                    All DE (subtype vs normal) probes in bed format for this cancer and not DE in any other subtype. 

        TCGA_survival/:
            TCGA-*/:
                prognostic.bed:
                    All probes associated with survival for this cancer. 
                    5th column has been replaced by p-value.
                stats.csv:
                    Cox linear model statistics for each Pol II probe. 
                    Warning: probes are prefiltered to have > 1 read in 3 samples, make sure to use row index to match 
                    with RNAP2 consensuses.
            globally_prognostic.bed:
                Probes associated with survival in 5+ cancers (FPR permutation threshold). 5th column has been replaced with 
                the number of cancers this probe is associated with survival in.
    tabular/:
        Same as above but stored in a tabular binary format for DE and survival.

 

./metacluster_markers/:
    bed/:
        allPol2_datasetCount:
            For each tissue, all Pol II consensuses, with 5th column indicating the number of
            datasets (RNAP2, GTEx, ENCODE, TCGA tumour and normal) in which the RNAP2 consensus is
            considered a marker.
        robust_2_datasets_per_tissue:
            For each tissue, Pol II consensuses considered marker in 2+ datasets out of 5 
            (RNAP2, GTEx, ENCODE, TCGA tumour and normal).
    tabular/:
        Each Pol II consensus with marker information stored in a binary format.
  

Files

cancer_markers.zip

Files (6.1 GB)

Name Size Download all
md5:dcaee780dcd233db2b5df7cdd42c064f
768.3 MB Preview Download
md5:db630b6319b6dbdae2f704ade8802eb1
4.4 GB Preview Download
md5:5fbed4ab60780777f45ae7d1a8f77816
36.9 MB Preview Download
md5:bdcbd517e6993aedde19d67d14434dea
62.4 MB Preview Download
md5:8d191f7f4a088b100e0db6c83ccb7b8b
7.0 kB Preview Download
md5:84c7c6b8fa37c25c25db4f75059f864a
646.2 MB Preview Download
md5:f565574a4671431b6b73f9454b664457
3.4 MB Preview Download
md5:3ca0dec69af7d21d526011a6aa386e05
1.6 MB Preview Download
md5:e7b4a0c46b94a1ac7266f702a8753fd9
134.2 MB Preview Download

Additional details

Related works

Is supplemented by
Software: https://github.com/benoitballester/Pol2Atlas (URL)