Project background

  • Project owner/contact: Raught et al.

  • Project description


Targets & cancer associations

  • Each protein in the target set is annotated with:






Poorly or uncharacterized targets

  • The aim of this section is to highlight poorly characterized genes or genes with unknown function in the target set
  • A set of uncharacterized/poorly characterized human protein-coding genes (n = 2819) have been established based on
    1. Genes specifically designated as uncharacterized or as open reading frames
    2. Missing gene function summary in NCBI Gene AND function summary in UniProt Knowledgebase
    3. Missing or limited (<= 2) gene ontology (GO) annotations with respect to molecular function (MF) and biological process (BP)
  • Target genes found within the set of poorly characterized genes are listed below, colored in varying shades of red according to the level of missing characterization (from  unknown function  to  poorly defined function )






Drug-target associations

  • Each protein in the target set is annotated with:
    • Targeted cancer drugs (inhibitors/antagonists), as found through the Open Targets Platform
    • We distinguish between drugs in early clinical development/phase (ep), and drugs already in late clinical development/phase (lp)






Protein complexes

  • Each protein in the target set is annotated with:
    • Known protein complexes as found in CORUM
    • The complexes are ranked according to the total number of participating members in the target set






Subcellular structures/compartments


  • The target set is annotated with data from ComPPI, a database of subcellular localization data for human proteins, and results are here presented in two different views:

    1. A subcellular anatogram - acting as a “heatmap” of subcellular structures associated with proteins in the target set
      • Compartments are here limited to the key compartments (n = 24) defined within the gganatogram package
      • An accompanying legend is also provided - depicting the locations of the various subcellular structures
    2. A subcellular data browser
      • All subcellular compartment annotations pr. protein in the target set (“By Gene”)
      • All unique subcellular compartment annotations (unfiltered) and their target members (“By Compartment”)
      • Subcellular compartment annotations per gene are provided with a confidence level - indicating the number of different sources that support the compartment annotation
        • Minimum confidence level set by user: 1



Subcellular anatogram


Heatmap - target set


  • In the image below, value refers to the fraction of target genes that are annotated with a particular compartment/subcellular structure



Legend - subcellular structures


Subcellular data browser


By Gene





By Compartment
  • Genes listed per compartment are calculated using only compartment annotations with a minimum confidence level of: 1 (number of sources)






Tissue and cell type enrichment



Target set - tissue specificity

  • Genes have been classified, based on mean expression (across samples) per tissue in GTex, into distinct specificity categories (algorithm developed within HPA):
    • Not detected: Genes with a mean expression level less than 1 (TPM < 1) across all the tissues.
    • Tissue enriched: Genes with a mean expression level greater than or equal to 1 (TPM >= 1) that also have at least four-fold higher expression levels in a particular tissue compared to all other tissues.
    • Group enriched: Genes with a mean expression level greater than or equal to 1 (TPM >= 1) that also have at least four-fold higher expression levels in a group of 2-5 tissues compared to all other tissues, and that are not considered Tissue enriched.
    • Tissue enhanced: Genes with a mean expression level greater than or equal to 1 (TPM >= 1) that also have at least four-fold higher expression levels in a particular tissue compared to the average levels in all other tissues, and that are not considered Tissue enriched or Group enriched.
    • Low tissue specificity: Genes with an expression level greater than or equal to 1 (TPM >= 1) across all of the tissues that are not in any of the above 4 groups.
    • Mixed: Genes that are not assigned to any of the above 5 groups.
  • Enrichment of specific tissues in the target set (with respect to tissue-specific gene expression) is performed with TissueEnrich
    • Only tissues that are enriched with an adjusted (Benjamini-Hochberg) p-value < 0.05 are listed





Tissue specificities per target gene




Tissue enrichment - target set

  • Considering the tissue specificities of members of the target set, NO TISSUES are enriched (adjusted p-value < 0.05) compared to the background set.





Target set - cell type specificity

  • Genes have been classified, based on mean expression (across samples) per cell type, into distinct specificity categories (algorithm developed within HPA):
    • Not detected: Genes with a mean expression level less than 1 (NX < 1) across all the cell types.
    • Cell type enriched: Genes with a mean expression level greater than or equal to 1 (NX >= 1) that also have at least four-fold higher expression levels in a particular cell type compared to all other cell types.
    • Group enriched: Genes with a mean expression level greater than or equal to 1 (NX >= 1) that also have at least four-fold higher expression levels in a group of 2-10 cell types compared to all other cell types, and that are not considered Cell type enriched.
    • Cell type enhanced: Genes with a mean expression level greater than or equal to 1 (NX >= 1) that also have at least four-fold higher expression levels in a particular cell type compared to the average levels in all other cell types, and that are not considered Cell type enriched or Cell type enriched.
    • Low cell type specificity: Genes with an expression level greater than or equal to 1 (NX >= 1) across all of the cell types that are not in any of the above 4 groups.
    • Mixed: Genes that are not assigned to any of the above 5 groups.
  • Enrichment of specific cell types in the target set (with respect to cell type-specific gene expression) is performed with TissueEnrich
    • Only cell types that are enriched with an adjusted (Benjamini-Hochberg) p-value < 0.05 are listed





Cell type specififies per target gene




Cell type enrichment - target set



  • Considering the cell-type specificities of members of the target set, NO CELL TYPES are enriched (adjusted p-value < 0.05) compared to the background set.





Protein-protein interaction network

  • The target set is queried against known protein-protein interactions, as evident from the STRING API (v11)
    • Note that interactions in STRING are assembled from multiple sources, including co-expression, co-occurrence in the literature, experimental data, curated databases etc
    • In addition to potential interactions within the target set, the network is expanded with n = 50 proteins that interact with proteins in the target set
    • Network is here restricted to interactions with STRING association score >= 900 (range 0-1000))
    • Drugs added to the network: TRUE
    • Three different views are shown
      • Complete protein-protein interaction network, also showing proteins with no known interactions
      • Network community structures, as detected by the fast greedy modularity optimization algorithm by Clauset et al.
      • Network centrality/hub scores pr. node, as measured by Kleinberg’s score


  • Network legend:
    • Target set proteins are shaped as circles, other interacting proteins are shaped as rectangles (note that sizes of nodes do not carry any value), drugs are shaped as diamonds
    • Tumor suppressor genes (annotated from CancerMine) are HIGHLIGHTED IN RED
    • Proto-oncogenes (annotated from CancerMine) are HIGHLIGHTED IN GREEN
    • Genes predicted to have a dual role as proto-oncogenes/tumor suppressors (annotated from CancerMine) are HIGHLIGHTED IN BLACK
    • Targeted cancer drugs (from Open Targets Platform):
      • Compounds in late (3-4) clinical phases are HIGHLIGHTED IN ORANGE
      • Compounds in early (1-2) clinical phases are HIGHLIGHTED IN PURPLE


  • Use the mouse to zoom in/out, alter the position of nodes, mouse-over to view gene names/drug mechanism of actions (with indications)



Complete network



Network communities



Network hubs




Function and pathway enrichment


  • Enrichment/overrepresentation test settings
    • P-value cutoff: 0.05
    • Q-value cutoff: 0.2
    • Correction for multiple testing: BH
    • Minimal size of genes annotated by term for testing: 10
    • Maximal size of genes annotated by term for testing: 500
    • Background genes: All protein-coding genes



Gene Ontology





Molecular Signatures Database (MSigDB)





KEGG





WikiPathways





TCGA aberration frequency



SNVs/InDels

  • Somatic SNVs/InDels in the target genes (top mutated) are illustrated with oncoplots
  • Gene mutation frequencies are sorted by type of diagnosis (i.e. cancer subtypes)



Breast

Colon/Rectum

Lung

Skin

Esophagus/Stomach

Cervix

Prostate

Ovary/Fallopian Tube

Uterus

Pancreas

Soft Tissue

Myeloid

CNS/Brain

Liver

Kidney

Lymphoid

Head and Neck

Thyroid

Biliary Tract

Bladder/Urinary Tract

Pleura

Thyroid

Copy number alterations

  • Genes targeted by somatic copy number alterations (sCNAs) in tumor samples have been retrieved from TCGA, where copy number state have been estimated with GISTIC
  • Gene aberration frequency are plotted across two categories of mutation types
    1.   sCNA - amplifications  
    2.   sCNA - homozygous deletions  
  • Frequency is plotted pr. primary site as
    • percent_mutated (percent of all tumor samples with the gene amplified/lost)
    • genes in the plot are ranked according to alteration frequency across all sites (pancancer), limited to the top 75 genes in the target set
  • Frequencies across all subtypes per primary site are listed in an interactive table










TCGA co-expression

  • Using RNA-seq data from ~10,000 tumor samples in TCGA, a co-expression correlation matrix (Pearson rank correlation coefficient) was calculated, indicating pairs of genes that have their expression patterns correlated in tumors
  • Here, we are showing, across the main primary tumor sites in TCGA:
    • Tumor suppressor genes, proto-oncogenes or cancer driver genes with a strong/very strong (r >= 0.6 or r <= -0.6 ) correlation to genes in the target set



Positive correlation





Negative correlation







TCGA prognostic associations

  • Based on data from the Human Protein Atlas - Pathology Atlas, we are here listing significant results from correlation analyses of mRNA expression levels of human genes in tumor tissue and the clinical outcome (survival) for ~8,000 cancer patients (TCGA)
  • All correlation analyses have been performed in a gene-centric manner, and associations are only shown for genes in the target set. We separate between
    •   Favorable associations   : high expression of a given gene is associated with better survival
    •   Unfavorable associations   : high expression of a given gene is associated with worse survival
  • Strength of associations are provided through p-values (only associations with a p-value< = 0.001 are provided), in addition we provide a percentile rank for associations considering
    1. all significant (p-value <= 0.001) associations across all tumor sites, and
    2. only significant (p-value <= 0.001) associations found in the same tumor site



Favorable associations





Unfavorable associations







CRISPR/Cas9 loss-of-fitness

  • In Project Score, systematic genome-scale CRISPR/Cas9 drop-out screens are performed in a large number of highly-annotated cancer models to identify genes required for cell fitness in defined molecular contexts
  • Here, we are showing, across the main human tissue types:
    • Genes in the target set that are annotated with a statistically significant effect on cell fitness in any of the screened cancer cell lines (fitness score is here considered a quantitative measure of the reduction of cell viability elicited by a gene inactivation, via CRISPR/Cas9 targeting). The fitness score is computed based on the BAGEL and CRISPRCleanR algorithms.



Loss-of-fitness distribution




Loss-of-fitness table





Documentation

Annotation resources

The analysis performed in the oncoEnrichR report is based on the following main tools and knowledge resources:

  • Software
    • oncoEnrichR - R package for functional interrogation of genesets in the context of cancer (v0.8.3)
    • clusterProfiler - R package for comparing biological themes among gene clusters (v3.18.0)
    • tissueEnrich - R package used to calculate enrichment of tissue-specific genes in a set of input genes (v1.10.0)
    • visNetwork - R package for network visualization using vis.js library (2.0.9)

  • Databases/datasets
    • STRING - Protein-protein interaction database (v11.0)
    • GENCODE - High quality reference gene annotation and experimental validation (v36)
    • TCGA - The Cancer Genome Atlas - Tumor gene expression and somatic DNA aberrations (v27.0 (October 29th 2020))
    • UniProtKB - Comprehensive resource of protein sequence and functional information (v2020_06)
    • CORUM - The comprehensive resource of mammalian protein complexes (v3.0 (20180903))
    • EFO - Experimental Factor Ontology (v3.26)
    • DiseaseOntology - Human Disease Ontology (2020-12-21)
    • COMPPI - Compartmentalized protein-protein interaction database (v2.1.1 (Oct 2018))
    • WikiPathways - A database of biological pathways maintained by and for the scientific community (20210110)
    • MSigDB - Molecular Signatures Database - collection of annotated gene sets (v7.2 (September 2020))
    • REACTOME - Manually curated and peer-reviewed pathway database (v73 (MSigDB v7.2))
    • GeneOntology - Knowledgebase that contains the largest structural source of information on the functions of genes (September 2020 (MSigDB v7.2))
    • KEGG - Collection of manually drawn pathway maps representing our knowledge on the molecular interaction, reaction and relation networks (20210118)
    • CancerMine - Literature-mined database of tumor suppressor genes/proto-oncogenes (v32 - 20210107)
    • NCG - Network of cancer genes - a web resource to analyze duplicability, orthology and network properties of cancer genes (v6.0)
    • Human Protein Atlas - Knowledge resource on human proteins in relation to tissue/cell type specificity and cancer prognosis (v20 - 20201119)
    • Genotype-Tissue Expression (GTEx) project - Ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation (v7)
    • Project Score - Database with systematic genome-scale CRISPR/Cas9 drop-out screens in a large number of highly-annotated cancer models (Release 1 (5th April 2019))
    • Open Targets Platform - Comprehensive and robust data integration for access to potential drug targets associated with disease (2020_11)

References

Ashburner, M, C A Ball, J A Blake, D Botstein, H Butler, J M Cherry, A P Davis, et al. 2000. “Gene Ontology: Tool for the Unification of Biology. The Gene Ontology Consortium.” Nat. Genet. 25 (1): 25–29. http://dx.doi.org/10.1038/75556.
Clauset, Aaron, M E J Newman, and Cristopher Moore. 2004. “Finding Community Structure in Very Large Networks.” Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 70 (6 Pt 2): 066111. http://dx.doi.org/10.1103/PhysRevE.70.066111.
Giurgiu, Madalina, Julian Reinhard, Barbara Brauner, Irmtraud Dunger-Kaltenbach, Gisela Fobo, Goar Frishman, Corinna Montrone, and Andreas Ruepp. 2019. CORUM: The Comprehensive Resource of Mammalian Protein Complexes—2019.” Nucleic Acids Res. 47 (D1): D559–63. https://academic.oup.com/nar/article-abstract/47/D1/D559/5144160.
Hart, Traver, and Jason Moffat. 2016. BAGEL: A Computational Framework for Identifying Essential Genes from Pooled Library Screens.” BMC Bioinformatics 17 (April): 164. http://dx.doi.org/10.1186/s12859-016-1015-8.
Iorio, Francesco, Fiona M Behan, Emanuel Gonçalves, Shriram G Bhosle, Elisabeth Chen, Rebecca Shepherd, Charlotte Beaver, et al. 2018. “Unsupervised Correction of Gene-Independent Cell Responses to CRISPR-Cas9 Targeting.” BMC Genomics 19 (1): 604. http://dx.doi.org/10.1186/s12864-018-4989-y.
Jain, Ashish, and Geetu Tuteja. 2019. TissueEnrich: Tissue-Specific Gene Enrichment Analysis.” Bioinformatics 35 (11): 1966–67. http://dx.doi.org/10.1093/bioinformatics/bty890.
Joshi-Tope, G, M Gillespie, I Vastrik, P D’Eustachio, E Schmidt, B de Bono, B Jassal, et al. 2005. “Reactome: A Knowledgebase of Biological Pathways.” Nucleic Acids Res. 33 (Database issue): D428–32. http://dx.doi.org/10.1093/nar/gki072.
Kanehisa, M, and S Goto. 2000. KEGG: Kyoto Encyclopedia of Genes and Genomes.” Nucleic Acids Res. 28 (1): 27–30. http://dx.doi.org/10.1093/nar/28.1.27.
Kelder, Thomas, Martijn P van Iersel, Kristina Hanspers, Martina Kutmon, Bruce R Conklin, Chris T Evelo, and Alexander R Pico. 2012. WikiPathways: Building Research Communities on Biological Pathways.” Nucleic Acids Res. 40 (Database issue): D1301–7. http://dx.doi.org/10.1093/nar/gkr1074.
Kleinberg, Jon M. 1999. “Authoritative Sources in a Hyperlinked Environment.” J. ACM 46 (5): 604–32. http://doi.acm.org/10.1145/324133.324140.
Koscielny, Gautier, Peter An, Denise Carvalho-Silva, Jennifer A Cham, Luca Fumis, Rippa Gasparyan, Samiul Hasan, et al. 2017. “Open Targets: A Platform for Therapeutic Target Identification and Validation.” Nucleic Acids Res. 45 (D1): D985–94. http://dx.doi.org/10.1093/nar/gkw1055.
Mermel, Craig H, Steven E Schumacher, Barbara Hill, Matthew L Meyerson, Rameen Beroukhim, and Gad Getz. 2011. Gistic2.0 Facilitates Sensitive and Confident Localization of the Targets of Focal Somatic Copy-Number Alteration in Human Cancers.” Genome Biol. 12 (4): R41. http://dx.doi.org/10.1186/gb-2011-12-4-r41.
Petryszak, Robert, Maria Keays, Y Amy Tang, Nuno A Fonseca, Elisabet Barrera, Tony Burdett, Anja Füllgrabe, et al. 2016. “Expression Atlas Update—an Integrated Database of Gene and Protein Expression in Humans, Animals and Plants.” Nucleic Acids Res. 44 (D1): D746–52. https://academic.oup.com/nar/article-abstract/44/D1/D746/2502589.
Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proc. Natl. Acad. Sci. U. S. A. 102 (43): 15545–50. http://dx.doi.org/10.1073/pnas.0506580102.
Uhlen, Mathias, Cheng Zhang, Sunjae Lee, Evelina Sjöstedt, Linn Fagerberg, Gholamreza Bidkhori, Rui Benfeitas, et al. 2017. “A Pathology Atlas of the Human Cancer Transcriptome.” Science 357 (6352). http://dx.doi.org/10.1126/science.aan2507.
Uhlén, Mathias, Linn Fagerberg, Björn M Hallström, Cecilia Lindskog, Per Oksvold, Adil Mardinoglu, Åsa Sivertsson, et al. 2015. “Proteomics. Tissue-Based Map of the Human Proteome.” Science 347 (6220): 1260419. http://dx.doi.org/10.1126/science.1260419.
Von Mering, Christian, Lars J Jensen, Berend Snel, Sean D Hooper, Markus Krupp, Mathilde Foglierini, Nelly Jouffre, Martijn A Huynen, and Peer Bork. 2005. STRING: Known and Predicted Protein–Protein Associations, Integrated and Transferred Across Organisms.” Nucleic Acids Res. 33 (suppl_1): D433–37. https://academic.oup.com/nar/article-abstract/33/suppl_1/D433/2505197.
Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS 16 (5): 284–87. http://dx.doi.org/10.1089/omi.2011.0118.



DISCLAIMER:The information contained in this report is more of an exploratory procedure than a statistical analysis. The final interpretation, i.e. putting the results in the context of the study/screen, should be made by biologists/analysts rather than by any tool.