PheKnowLator Human Disease Knowledge Graphs - Build Data (Original)

Callahan, Tiffany J

doi:10.5281/zenodo.7026640

Published May 1, 2021 | Version v2.1.0_01MAY2021

Dataset Open

PheKnowLator Human Disease Knowledge Graphs - Build Data (Original)

Callahan, Tiffany J¹

1. University of Colorado Anschutz Medical Campus

RELEASE V2.1.0 KNOWLEDGE GRAPH: ORIGINAL DATA SOURCES

Release: v2.1.0

The goal of this build was to create a knowledge graph that represented human disease mechanisms and included the central dogma. The data sources utilized in this release include many of the sources used in the initial release, as well as some new data made available by the Comparative Toxicogenomics Database and experimental data from the Human Protein Atlas.

Data sources are listed by type (Ontology and Data not represented in an ontology [Database Sources]). Additional details are provided for each data source below. Please see documentation on the primary release (https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) for additional details on each data source as well as citation information.

Data Access:

https://console.cloud.google.com/storage/browser/pheknowlator/archived_builds/release_v2.1.0/build_01MAY2021

ONTOLOGIES

Cell Ontology
Cell Line Ontology
Chemical Entities of Biological Interest (ChEBI) Ontology
Gene Ontology
Human Phenotype Ontology
Mondo Disease Ontology
Pathway Ontology
Protein Ontology
Relations Ontology
Sequence Ontology
Uber-Anatomy Ontology
Vaccine Ontology

Cell Ontology (CL)

Homepage: GitHub
Citation:

Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biology. 2005;6(2):R21

Usage: Utilized to connect transcripts and proteins to cells. Additionally, the edges between this ontology and its dependencies are utilized:

ChEBI
GO
PATO
PRO
RO
UBERON

Cell Line Ontology (CLO)

Homepage: http://www.clo-ontology.org/
Citation:

Sarntivijai S, Lin Y, Xiang Z, Meehan TF, Diehl AD, Vempati UD, Schürer SC, Pang C, Malone J, Parkinson H, Liu Y. CLO: the cell line ontology. Journal of Biomedical Semantics. 2014;5(1):37

Usage: Utilized this ontology to map cell lines to transcripts and proteins. Additionally, the edges between this ontology and its dependencies are utilized:

CL
DOID
NCBITaxon
UBERON

Chemical Entities of Biological Interest (ChEBI)

Homepage: https://www.ebi.ac.uk/chebi/
Citation:

Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research. 2015;44(D1):D1214-9

Usage: Utilized to connect chemicals to complexes, diseases, genes, GO biological processes, GO cellular components, GO molecular functions, pathways, phenotypes, reactions, and transcripts.

Gene Ontology (GO)

Homepage: http://geneontology.org/
Citations:

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA. Gene ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25

The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Research. 2018;47(D1):D330-8

Usage: Utilized to connect biological processes, cellular components, and molecular functions to chemicals, pathways, and proteins. Additionally, the edges between this ontology and its dependencies are utilized:

CL
NCBITaxon
RO
UBERON

Other Gene Ontology Data Used: goa_human.gaf.gz

Human Phenotype Ontology (HPO)

Homepage: https://hpo.jax.org/
Citation:

Köhler S, Carmody L, Vasilevsky N, Jacobsen JO, Danis D, Gourdine JP, Gargano M, Harris NL, Matentzoglu N, McMurry JA, Osumi-Sutherland D. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Research. 2018;47(D1):D1018-27

Usage: Utilized to connect phenotypes to chemicals, diseases, genes, and variants. Additionally, the edges between this ontology and its dependencies are utilized:

CL
ChEBI
GO
UBERON

Files

Other Human Phenotype Ontology Data Used: phenotype.hpoa

Mondo Disease Ontology (Mondo)

Homepage: https://mondo.monarchinitiative.org/
Citation:

Mungall CJ, McMurry JA, Köhler S, Balhoff JP, Borromeo C, Brush M, Carbon S, Conlin T, Dunn N, Engelstad M, Foster E. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Research. 2017;45(D1):D712-22

Usage: Utilized to connect diseases to chemicals, phenotypes, genes, and variants. Additionally, the edges between this ontology and its dependencies are utilized:

CL
NCBITaxon
GO
HPO
UBERON

Pathway Ontology (PW)

Homepage: rgd.mcw.edu
Citation:

Petri V, Jayaraman P, Tutaj M, Hayman GT, Smith JR, De Pons J, Laulederkind SJ, Lowry TF, Nigam R, Wang SJ, Shimoyama M. The pathway ontology–updates and applications. Journal of Biomedical Semantics. 2014;5(1):7.

Usage: Utilized to connect pathways to GO biological processes, GO cellular components, GO molecular functions, Reactome pathways. Several steps are taken in order to connect Pathway Ontology identifiers to Reactome pathways and GO biological processes. To connect Pathway Ontology identifiers to Reactome pathways, we use ComPath Pathway Database Mappings developed by Daniel Domingo-Fernández (PMID:30564458).

Files

Downloaded Mapping Data
- curated_mappings.txt
- kegg_reactome.csv
Generated Mapping Data
- REACTOME_PW_GO_MAPPINGS.txt

Protein Ontology (PRO)

Homepage: https://proconsortium.org/
Citation:

Natale DA, Arighi CN, Barker WC, Blake JA, Bult CJ, Caudy M, Drabkin HJ, D’Eustachio P, Evsikov AV, Huang H, Nchoutmboube J. The Protein Ontology: a structured representation of protein forms and complexes. Nucleic Acids Research. 2010;39(suppl_1):D539-45

Usage: Utilized to connect proteins to chemicals, genes, anatomy, catalysts, cell lines, cofactors, complexes, GO biological processes, GO cellular components, GO molecular functions, pathways, proteins, reactions, and transcripts. Additionally, the edges between this ontology and its dependencies are utilized:

ChEBI
DOID
GO

Notes: A partial, human-only version of this ontology was used. Details on how this version of the ontology was generated can be found under the Protein Ontology section of the Data_Preparation.ipynb Jupyter Notebook.

Files

Generated Human Version Protein Ontology (PRO)
- human_pro.owl (closed with hermit reasoner)
Other PRO Data Used: promapping.txt
Generated Mapping Data
- Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
- Ensembl Transcript-PRO Identifier Mapping: ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt
- Entrez Gene-PRO Identifier Mapping: ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt
- UniProt Accession-PRO Identifier Mapping: UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt
- STRING-PRO Identifier Mapping: STRING_PRO_ONTOLOGY_MAP.txt

Relations Ontology (RO)

Homepage: GitHub
Citation:

Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C. Relations in biomedical ontologies. Genome Biology. 2005;6(5):R46.

Usage: Utilizing this ontology to connect all data sources in knowledge graph. Additionally, the ontology is queried prior to building the knowledge graph to identify all relations, their inverse properties, and their labels.

Files

Generated RO Data
- INVERSE_RELATIONS.txt
- RELATIONS_LABELS.txt

Sequence Ontology (SO)

Homepage: GitHub
Citation:

Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biology. 2005;6(5):R44

Usage: Utilized to connect transcripts and other genomic material like genes and variants.

Files

Generated Mapping Data
- genomic_sequence_ontology_mappings.xlsx
- SO_GENE_TRANSCRIPT_VARIANT_TYPE_MAPPING.txt

Uber-Anatomy Ontology (Uberon)

Homepage: GitHub
Citation:

Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA. Uberon, an integrative multi-species anatomy ontology. Genome Biology. 2012;13(1):R5

Usage: Utilized to connect tissues, fluids, and cells to proteins and transcripts. Additionally, the edges between this ontology and its dependencies are utilized:

ChEBI
CL
GO
PRO

Vaccine Ontology (VO)

Homepage: http://www.violinet.org/vaccineontology/
Citations:

He Y, Racz R, Sayers S, Lin Y, Todd T, Hur J, Li X, Patel M, Zhao B, Chung M, Ostrow J. Updates on the web-based VIOLIN vaccine database and analysis system. Nucleic Acids Research. 2013;42(D1):D1124-32

Xiang Z, Todd T, Ku KP, Kovacic BL, Larson CB, Chen F, Hodges AP, Tian Y, Olenzek EA, Zhao B, Colby LA. VIOLIN: vaccine investigation and online information network. Nucleic Acids Research. 2007;36(suppl_1):D923-8

Usage: Utilized the edges between this ontology and its dependencies:

ChEBI
DOID
GO
PRO
UBERON

DATABASE SOURCES

BioPortal
ClinVar
Comparative Toxicogenomics Database
DisGeNET
Ensembl
GeneMANIA
Genotype-Tissue Expression Project
Human Genome Organisation Gene Nomenclature Committee
Human Protein Atlas
National Center for Biotechnology Information Gene
Reactome Pathway Database
Search Tool for Recurring Instances of Neighbouring Genes Database
Universal Protein Resource Knowledgebase

BioPortal

Homepage: BioPortal
Citation:

BioPortal. Lexical OWL Ontology Matcher (LOOM)

Ghazvinian A, Noy NF, Musen MA. Creating mappings for ontologies in biomedicine: simple methods work. In AMIA Annual Symposium Proceedings 2009 (Vol. 2009, p. 198). American Medical Informatics Association

Usage: BioPortal was utilized to obtain mappings between MeSH identifiers and ChEBI identifiers for chemicals-diseases, chemicals-genes, chemical-GO biological processes, chemicals-GO cellular components, chemicals-GO molecular functions, chemicals-phenotypes, chemicals-proteins, and chemicals-transcripts. Additional information on how this data was processed can be obtained from the NCBO_rest_api.py GitHub Gist script.

⭐ ALTERNATIVE METHOD⭐ Since the above approach can take over two days to process, we have developed an alternative solution that downloads the mesh2021.nt data file directly from MeSH and the Flat_file_tab_delimited/names.tsv.gz file directly from ChEBI. Using these files, we have recapitulated the LOOM algorithm implemented by BioPortal when creating mappings between these resources. The procedure is relatively straightforward and utilizes the following information from each resource:

For all MeSH SCR Chemicals, obtain the following information:
- Identifiers: MeSH identifiers
- Labels: string labels using the RDFS:label object property
- Synonyms: track down all synonyms using the vocab:concept and vocab:preferredConcept object properties
For all ChEBI classes, obtain the following information:
- Labels: string labels using the RDFS:label object property
- Synonyms: track down all synonyms using all synonym object properties

Files

Generated Data: MESH_CHEBI_MAP.txt

ClinVar

Homepage: https://www.ncbi.nlm.nih.gov/clinvar/
Citation:

Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Jang W, Karapetyan K. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Research. 2017;46(D1):D1062-7

Usage: ClinVar was utilized to create variant-gene, variant-disease, and variant-phenotype edges. The original data is filtered such that only records meeting the following criteria were included:

Assembly = "GRCh38"
ClinSigSimple = 1
- 1 = at least one current record submitted with an interpretation of Likely pathogenic or Pathogenic (independent of whether that record includes assertion criteria and evidence)"
ReviewStatus in ["criteria provided, multiple submitters, no conflicts", "reviewed by expert panel", "practice guideline"]

Files

Downloaded Data
Generated Edge Data: CLINVAR_VARIANT_GENE_DISEASE_PHENOTYPE_EDGES.txt

Comparative Toxicogenomics Database (CTD)

Homepage: http://ctdbase.org/
Citations:

Curated [chemical–gene interactions|chemical-go interactions|chemical–disease interactions|gene–pathway interactions] data were retrieved from the Comparative Toxicogenomics Database (CTD), MDI Biological Laboratory, Salisbury Cove, Maine, and NC State University, Raleigh, North Carolina. World Wide Web (URL: http://ctdbase.org/)

Davis AP, Grondin CJ, Johnson RJ, Sciaky D, McMorran R, Wiegers J, Wiegers TC, Mattingly CJ. The comparative toxicogenomics database: update 2019. Nucleic Acids Research. 2018;47(D1):D948-54

Usage: Comparative Toxicogenomics Database (CTD) was utilized to create chemical-disease, chemical-gene, chemical-GO biological process, chemical-GO cellular components, chemical-GO molecular functions, chemical-phenotype, chemical-protein, chemical-rna, and gene-pathway edges. The original data is filtered such that only records meeting the following criteria were included:

chemical-disease: DirectEvidence != ""
chemical-gene: Organism == "Homo sapiens", GeneForms == "gene", and affects not in InteractionActions
chemical-GO biological process: PhenotypeName == "Biological Process" and Interaction <= "1.04e-47" (10th percentile)
chemical-GO cellular components: PhenotypeName == "Cellular Component" and Interaction <= "1.04e-47" (10th percentile)
chemical-GO molecular functions: PhenotypeName == "Molecular Function" and Interaction <= "1.04e-47" (10th percentile)
chemical-phenotype: DirectEvidence != ""
chemical-protein: Organism == "Homo sapiens", GeneForms == "protein", and affects not in InteractionActions
chemical-rna: Organism == "Homo sapiens", GeneForms == "mRNA", and affects and activity not in InteractionActions
gene-pathway edges: PathwayName == R-HSA-

Files

Downloaded Data
- Chemical-Gene Relations: CTD_chem_gene_ixns.tsv.gz
- Chemical-Disease/Phenotype Relations: CTD_chemicals_diseases.tsv.gz
- Chemical-GO Relations: CTD_chem_go_enriched.tsv.gz
- Gene-Pathway Relations: CTD_genes_pathways.tsv.gz

DisGeNET

Homepage: https://www.disgenet.org/
Citation:

Gene-disease association data retrieved from DisGeNET v6.0 (http://www.disgenet.org/), Integrative Biomedical Informatics Group GRIB/IMIM/UPF. [December, 2019].

Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research. 2019.

Usage: DisGeNET was utilized to create gene-disease, and gene-phenotype edges. The original data is filtered such that only records meeting the following criteria were included: EI >= "1.0" (90th percentile). Additionally, data from this source was used to create mappings between different types of disease and phenotype identifiers, including:

OMIM, ORPHA, UMLS, ICD ➞ DOID
OMIM, ORPHA, UMLS, ICD ➞ HPO

Files

Downloaded Data
- Disease/Phenotype-Gene Relations: curated_gene_disease_associations.tsv.gz
- Disease Identifier Mapping: disease_mappings.tsv.gz
Generated Mapping Data
- Disease Identifier Mapping: PHENOTPYE_HPO_MAP.txt
- Phenotype Identifier Mapping: DISEASE_DOID_MAP.txt

Ensembl

Homepage: https://uswest.ensembl.org/index.html
Citation:

Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, Billis K, Cummins C, Gall A, Girón CG, Gil L. Ensembl 2018. Nucleic Acids Research. 2017;46(D1):D754-61

Usage: Ensembl data was utilized to create mappings between Ensembl genes, transcripts, and proteins with NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers in the knowledge graph (for additional details on the processing of these data, see Data_Preparation.ipynb):

Ensembl Transcript IDs ➞ PRO IDs
Gene Ensembl IDs ➞ Entrez Gene IDs
Gene Ensembl IDs ➞ PRO IDs
Gene Symbols ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ PRO IDs
Protein Ensembl IDs ➞ UniProt Protein Accession
STRING IDs ➞ PRO IDs
UniProt Protein Accession ➞ Entrez Gene IDs

Files

Downloaded Data
Generated Mapping Data
- Cleaned Ensembl Gene Set: ensembl_identifier_data_cleaned.txt
- Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
- Ensembl Transcript-PRO Identifier Mapping: ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt
- Gene Symbol-Ensembl Transcript Identifier Mapping: GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt
- Entrez Gene-Ensembl Transcript Identifier Mapping: ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt
- Entrez Gene-PRO Identifier Mapping: ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt
- Ensembl Gene-Entrez Gene Identifier Mapping: ENSEMBL_GENE_ENTREZ_GENE_MAP.txt

GeneMANIA

Homepage: https://genemania.org/
Citation:

Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C, Kazi F, Lopes CT, Maitland A. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research. 2010;38(suppl_2):W214-20

Usage: GeneMANIA was utilized to create gene-gene edges.

Files

Downloaded Data: COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt

Genotype-Tissue Expression Project (GTEx)

Homepage: https://gtexportal.org/home/
Citation:

Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, Foster B. The genotype-tissue expression (GTEx) project. Nature Genetics. 2013;45(6):580

Usage: The Genotype-Tissue Expression (GTEx) Project was utilized to create edges between protein-cell, protein-anatomy, rna-cell and rna-anatomy entities. The original data were filtered such that only those edges where the median TPM was >=1.0 and genes were of any type other than protein-coding were included. It should also be noted that we chose to use the RNASeQC file over the RSEM file as advised by the GTEx website.

The RSEM estimates are based on combining isoform-level estimates, which adds uncertainty to the resulting gene-level values (the isoform-level estimates are highly inaccurate in some cases).

The file contains 54 unique tissue and/or cell types. GTEx provides mappings from tissue types to UBERON and EFO. These provided mappings were verified and extended, such that all samples which referenced a cell type were also mapped to the Cell and the Cell Line ontologies. This resulted in a total of 56 mappings (1.04 mappings/concepts).

Files

Downloaded Data: GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct
Mapping Results: zooma_tissue_cell_mapping_04JAN2020.xlsx
Generated Data
The final mapping set was combined with terms from the Human Protein Atlas, see here for more information.
- All HPA tissue and cell type strings: HPA_tissues.txt
- Final Term Mapping: HPA_GTEx_TISSUE_CELL_MAP.txt
- Final RNA, Gene, Protein-Tissues and Cell Types Relations: HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt

Human Genome Organisation Gene Nomenclature Committee (HUGO)

Homepage: https://www.genenames.org/
Citations:

HGNC Database, HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom www.genenames.org

Yates B, Braschi B, Gray K, Seal R, Tweedie S, Bruford E. Genenames.org: the HGNC and VGNC Resources in 2017. Nucleic Acids Research. 2017;45(D1):D619-625

Usage: The Human Genome Organisation (HUGO) data was utilized to obtain mappings between NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb:

Ensembl Transcript IDs ➞ PRO IDs
Gene Ensembl IDs ➞ Entrez Gene IDs
Gene Ensembl IDs ➞ PRO IDs
Gene Symbols ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ PRO IDs
Protein Ensembl IDs ➞ UniProt Protein Accession
STRING IDs ➞ PRO IDs
UniProt Protein Accession ➞ Entrez Gene IDs

Files

Downloaded Data: hgnc_complete_set.txt
Generated Data
- Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
- Gene Symbol-Ensembl Transcript Identifier Mapping: GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt

Human Protein Atlas (HPA)

Homepage: https://www.proteinatlas.org/
Citation:

Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson Å, Kampf C, Sjöstedt E, Asplund A, Olsson I. Tissue-based map of the human proteome. Science. 2015;347(6220):1260419

Usage: The Human Protein Atlas (HPA) was utilized to create rna-cell, rna-anatomy, protein-cell, and protein-anatomy edges. Evidence between gene and RNA expression in specific tissue types was derived by HPA, such that the consensus normalized expression was >=1.0. Zooma was utilized to automatically annotate the 153 unique tissues and cell types from Human Protein Atlas for all human protein-coding genes in the Human Proteome to the Cell Ontology, Cell Line Ontology, and the Uber-Anatomy Ontology. To best represent each concept, the automatic mappings from Zooma were extend through manual mapping efforts to ensure each concept cell type was matched to a Cell Ontology, Cell Line Ontology, and UBERON ontology term. This resulted in a total of 281 mappings (1.84 mappings/concepts).

Files

Downloaded Data: proteinatlas_search.tsv
Mapping Results: zooma_tissue_cell_mapping_04JAN2020.xlsx
Generated Data
- Final Term Mapping: HPA_GTEx_TISSUE_CELL_MAP.txt
- Final RNA, Gene, Protein-Tissues and Cell Types Relations: HPA_GTEX_RNA_GENE_PROTEIN_EDGES.txt

National Center for Biotechnology Information (NCBI) Entrez Gene

Homepage: https://www.ncbi.nlm.nih.gov/gene/
Citation:

Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research. 2005;33(suppl_1):D54-8.

Usage: The National Center for Biotechnology Information (NCBI) Gene data was utilized to obtain mappings between NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb:

Ensembl Transcript IDs ➞ PRO IDs
Gene Ensembl IDs ➞ Entrez Gene IDs
Gene Ensembl IDs ➞ PRO IDs
Gene Symbols ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ PRO IDs
Protein Ensembl IDs ➞ UniProt Protein Accession
STRING IDs ➞ PRO IDs
UniProt Protein Accession ➞ Entrez Gene IDs

Files

Downloaded Data: Homo_sapiens.gene_info.gz
Generated Data
- Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
- Entrez Gene-Ensembl Transcript Identifier Mapping: ENTREZ_GENE_ENSEMBL_TRANSCRIPT_MAP.txt
- Entrez Gene-PRO Identifier Mapping: ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt
- Ensembl Gene-Entrez Gene Identifier Mapping: ENSEMBL_GENE_ENTREZ_GENE_MAP.txt
- Uniprot Accession-Entrez Gene Identifier Mapping: UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt

Reactome Pathway Database

Homepage: https://reactome.org/
Citation:

Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, Haw R, Jassal B, Korninger F, May B, Milacic M. The reactome pathway knowledgebase. Nucleic Acids Research. 2017;46(D1):D649-55

Usage: The Reactome Database was utilized to create chemical-pathway, GO Biological process-pathway, pathway-GO Cellular component, GO Molecular function-pathway, and protein-pathway edges. The original data is filtered such that only records meeting the following criteria were included:

chemical-pathway: column[5] == "Homo sapiens"
GO Biological process-pathway: column[5] startswith "REACTOME", column[8] == "P", and column[12] == "taxon:9606"
pathway-GO Cellular component: column[5] startswith "REACTOME", column[8] == "C", and column[12] == "taxon:9606"
GO Molecular function-pathway: column[5] startswith "REACTOME", column[8] == "F", and column[12] == "taxon:9606"
protein-pathway: column[5] == "Homo sapiens"

Files

Downloaded Data
- Chemical-Pathway Relations: ChEBI2Reactome_All_Levels.txt
- Pathway-GO Relations: gene_association.reactome
- Protein-Pathway Relations: UniProt2Reactome_All_Levels.txt

Search Tool for Recurring Instances of Neighbouring Genes (STRING) Database

Homepage: string-db.org
Citation:

Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, Jensen LJ. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research. 2018;47(D1):D607-13

Usage: The Search Tool for Recurring Instances of Neighbouring Genes (STRING) Database was utilized to create protein-protein edges. The original data is filtered such that only records meeting the following criteria were included: combined_score >= "700" (>90th percentile).

Files

Downloaded Data: 9606.protein.links.v11.0.txt.gz
Generated Data: STRING-PRO Identifier Mapping: STRING_PRO_ONTOLOGY_MAP.txt

Universal Protein Resource (UniProt) Knowledgebase

Homepage: https://www.uniprot.org/
Citation:

UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic acids research. 2018;47(D1):D506-15

Usage: The Universal Protein Resource (UniProt) Knowledgebase was utilized to obtain cofactor/catalyst-protein and protein-coding gene-protein edges as well as mappings between NCBI Gene identifiers, HUGO gene symbols, Universal Protein Resource (UniProt) Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb:

Ensembl Transcript IDs ➞ PRO IDs
Gene Ensembl IDs ➞ Entrez Gene IDs
Gene Ensembl IDs ➞ PRO IDs
Gene Symbols ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ Transcript Ensembl IDs
Entrez Gene IDs ➞ PRO IDs
Protein Ensembl IDs ➞ UniProt Protein Accession
STRING IDs ➞ PRO IDs
UniProt Protein Accession ➞ Entrez Gene IDs

Files

Downloaded Data
- Cofactor and Catalyst relations: Cofactor/Catalyst Query Results
- UniProt Identifier Mapping: UniProt Identifier Query Results
Generated Data
- Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
- Protein-Cofactor Relations: UNIPROT_PROTEIN_COFACTOR.txt
- Protein-Catalyst Relations: UNIPROT_PROTEIN_CATALYST.txt
- UniProt Accession-PRO Identifier Mapping: UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt
- UniProt Accession-Entrez Gene Identifier Mapping: UNIPROT_ACCESSION_ENTREZ_GENE_MAP.txt

This project is licensed under Apache License 2.0 - see the LICENSE.md file for details. If you intend to use any of the information on this Wiki, please provide the appropriate attribution by citing this repository:

@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}

Files

9606.protein.links.v11.0.txt

Files (4.4 GB)

Name	Size	Download all
9606.protein.links.v11.0.txt md5:acf948bc4f951a13e01cc8e7c360782f	540.9 MB	Preview Download
ChEBI2Reactome_All_Levels.txt md5:b7145a51aca776d27f2d4a7c9af7b127	28.3 MB	Preview Download
chebi_with_imports.owl md5:487cb7f3c6d7398f0754d172c4b9adec	606.9 MB	Download
clo_with_imports.owl md5:664a48d516f26fc520130ebd1184d43c	119.3 MB	Download
COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt md5:f527d7c1a6b3fe5bfb9994f4ce5c5328	246.2 MB	Preview Download
compath_canonical_pathway_mappings.txt md5:7bf44f5872f3d05c0a22b33271487488	196.4 kB	Preview Download
CTD_chem_gene_ixns.tsv md5:7290a3a682fb59217abb7fdf3e873b4a	422.5 MB	Download
CTD_chem_go_enriched.tsv md5:5661b32bb07e63965ddd4ffff3737223	798.4 MB	Download
CTD_chemicals_diseases.tsv md5:ac8487ab4182814a3455262683e78fdd	674.6 MB	Download
CTD_genes_pathways.tsv md5:1cb1ca9ebe618e234adf88952223b0b9	8.2 MB	Download
curated_gene_disease_associations.tsv md5:b1a897afe3040064fe32006551ebe519	11.5 MB	Download
disease_mappings.tsv md5:7cc7534396dde6b2c1f17f187f76646b	18.4 MB	Download
downloaded_build_metadata.txt md5:fea01fa416f2f0991d4fa45b2275a029	19.0 kB	Preview Download
ext_with_imports.owl md5:99a2022fcfb093c6c417094da6a41f9f	77.5 MB	Download
gene_association.reactome md5:5626dca57d21050cf9475e4d85681cb4	11.9 MB	Download
genomic_sequence_ontology_mappings.xlsx md5:1905cb8c42abfa1172788a21c18f404a	20.6 kB	Download
genomic_typing_dict.pkl md5:4eed05ac754b7673c4e6da06d40cd9ff	2.3 kB	Download
go_with_imports.owl md5:31cc99a7757c84a98ea1de5494d00e8c	163.9 MB	Download
goa_human.gaf md5:38eb3879878cbcd90f9c994074b3ea6d	103.3 MB	Download
GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct md5:3d6ec49c6d524437362d70f2b6822367	17.8 MB	Download
hgnc_complete_set.txt md5:7513d121558c4dceb2d412b07225bc79	15.2 MB	Preview Download
Homo_sapiens.gene_info md5:3592199c5f07b66dd48ed04a688dbce2	13.6 MB	Download
Homo_sapiens.GRCh38.102.entrez.tsv md5:ff01c086e5bb6dacd480fa6b7c1b37c6	17.1 MB	Download
Homo_sapiens.GRCh38.102.gtf.zip md5:14e85694c029e351128d0664dd8f234a	48.3 MB	Preview Download
Homo_sapiens.GRCh38.102.uniprot.tsv md5:d197d79fa6209c7b7cd2a727daf1e857	13.5 MB	Download
hp_with_imports.owl md5:9916a17c2f6cac5ec5f8c2f990719502	80.2 MB	Download
human_pro_classes.html md5:0328ce2f7763790feaee6fad0a209335	8.8 MB	Download
kegg_reactome.csv md5:94473e04b47016229f7248f5632130c4	92.3 kB	Preview Download
mesh2021.nt.zip md5:7daa4c8892e7c76cb77b32b41f7332f8	118.8 MB	Preview Download
mondo_with_imports.owl.zip md5:c3778974749597c39fa223201827df93	11.3 MB	Preview Download
names.tsv md5:7bffe31ed398589eedaafc4d2b561413	29.0 MB	Download
phenotype.hpoa md5:c22b3ebe159ca7578ed066d637fa8c17	27.2 MB	Download
pr_with_imports.owl.zip md5:843ec1d33df18773d4b461940728df53	35.5 MB	Preview Download
promapping.txt md5:c48e06b9aecef62ee79a9f7e7746f6c6	15.3 MB	Preview Download
proteinatlas_search.tsv.zip md5:caccf6c6b56d5601138fa471cabc3c55	4.5 MB	Preview Download
pw_with_imports.owl md5:e9d718862eb3c7c64ffd0dba86344b73	5.0 MB	Download
ReactomePathways.txt md5:d1aaac559317e1944853b8a5c33b060f	1.4 MB	Preview Download
ro_with_imports.owl md5:61b82112719c4ec5f0779c95be71ec21	855.5 kB	Download
so_with_imports.owl md5:97e498543b858290686247e62611e76a	5.2 MB	Download
uniprot-cofactor-catalyst.tab.zip md5:aaf8c1e6114969e38f7b64fea77769b6	2.1 MB	Preview Download
UniProt2Reactome_All_Levels.txt.zip md5:f96a7220dfb3e05d4239f9ebe2c75451	10.9 MB	Preview Download
uniprot_identifier_mapping.tab md5:061631d98b70a43011bfe4bb582dcb47	8.5 MB	Download
variant_summary.txt.zip md5:e22668aecdb5aca8506cd19b8ae7ef9f	76.4 MB	Preview Download
vo_with_imports.owl.zip md5:6ea92b50af337402f13b5552164c512f	563.2 kB	Preview Download
zooma_tissue_cell_mapping_04JAN2020.xlsx.zip md5:8cd3213dbe602f0214f1a8c11677c199	18.1 MB	Preview Download

	All versions	This version
Views	388	388
Downloads	2,514	2,514
Data volume	299.4 GB	299.4 GB

PheKnowLator Human Disease Knowledge Graphs - Build Data (Original)

Authors/Creators

Description

Files

9606.protein.links.v11.0.txt

Files (4.4 GB)