GUILDify v2.0: A tool to identify molecular networks underlying human diseases, their comorbidities and their druggable targets

The genetic basis of complex diseases involves alterations on multiple genes. Unravelling the interplay between these genetic factors is key to the discovery of new biomarkers and treatments. In 2014, we introduced GUILDify, a web server that searches for genes associated to diseases, finds novel disease-genes applying various network-based prioritisation algorithms and proposes candidate drugs. Here, we present GUILDify v2.0, a major update and improvement of the original method, where we have included protein interaction data for seven species and 22 human tissues and incorporated the disease-gene associations from DisGeNET. To infer potential disease relationships associated with multi-morbidities, we introduced a novel feature for estimating the genetic and functional overlap of two diseases using the top-ranking genes and the associated enrichment of biological functions and pathways (as defined by GO and Reactome). The analysis of this overlap helps to identify the mechanistic role of genes and protein-protein interactions in comorbidities. Finally, we provided an R package, guildifyR, to facilitate programmatic access to GUILDify v2.0 (http://sbi.upf.edu/guildify2 )


Introduction
Complex diseases such as cancer, diabetes, neurodegenerative disorders or cardiovascular diseases are rarely caused by a single genetic perturbation and usually involve polygenic modifications on the underlying interconnected cellular network. Understanding the genetic basis of diseases and the interactions of disease-associated proteins in the protein interaction network (PIN) is essential for the development of new rational therapeutic strategies. Despite recent large-scale genotyping efforts, information on disease-gene associations is still limited, often explaining a small percentage of the phenotypic variance observed among individuals [1]. To address this limitation and infer novel disease-gene associations, various disease-gene prioritisation methods have been suggested, exploiting the "guilt-by-association" principle over certain features of disease-genes such as similarity in sequence and functional annotations, clustering in the linkage interval, or proximity in the PIN [2].
Indeed, albeit the PINs being incomplete [3], the proximity to disease-genes in the PIN has proven extremely useful in prioritising disease-associated genes [4]. Consequently, a number of tools and web servers has been developed to expand the number of disease-associated genes using the interactome [5][6][7][8][9].
Previously, we presented GUILDify, a web server that applies the prioritisation algorithms developed in GUILD software to find novel disease-gene associations based on the connectedness of genes in the PIN [10,11]. GUILDify searches for genes starting from user-provided keywords such as the names of diseases or gene symbols in the BIANA knowledge database. It uses the genes associated to the keywords as seeds and the PIN for the selected organism to apply graph theory algorithms to prioritise new disease genes. Recently, GUILDify has been applied to: (i) find comorbidities across genetic diseases [12]; (ii) construct PINs specific to breast cancer metastasis to lung and brain [13]; (iii) identify candidate genes for body size in sheep [14] and (iv) prioritise preeclampsia pathogenesis [15].
Here, we present a comprehensive upgrade, GUILDify v2.0, where we updated the underlying biological databases in BIANA knowledge database (protein and drug-target interactions, functional and disease annotations) and: (i) facilitated the use of seven species-specific PINs and 22 human tissue-specific PINs; (ii) increased the quality and number of disease-gene associations by incorporating DisGeNET to our datasets; (iii) incorporated the option to search by drug name, allowing the prioritisation of genes based on known drug targets to uncover the neighbourhood of the PIN affected by the drug; (iv) improved the visualisation of the results using cytoscape.js; (v) refined the definition of top-ranking genes based on whether they had similar functional annotations as the seeds, thus providing the biologically most coherent subnetwork relevant to a given disease; (vi) introduced a feature to measure the genetic and functional overlap of the top-ranking genes of two different diseases, supporting the investigation of disease comorbidities; (vii) implemented a new drug repurposing functionality to propose novel indications for a given drug based on the genetic and functional overlap; and (viii) developed an R package to facilitate the programmatic access to the methods implemented in the web server.

Identifying genetic and functional similarities across diseases
In recent works, we have shown that the genetic and functional similarities of diseases in the PIN can be used to characterise co-and multi-morbidities across diseases [12] and also to repurpose existing drugs targeting these diseases [16]. Motivated by these findings and to provide systematic insights on disease-disease relationships, GUILDify v2.0 now allows users to identify the overlap between two previously submitted results, i.e. sets of genes linked to two different diseases. Accordingly, given two job IDs corresponding to the prioritisation results of two different diseases, GUILDify v2.0 provides: (i) the overlap between the top-ranking genes of the two diseases; (ii) the overlap between the enriched functions among the top-ranking genes of the two diseases; (iii) the enriched functions among the common top-ranking genes; and (iv) a network visualisation of the interactions between common topranking genes. Moreover, GUILDify v2.0 also calculates the Fisher's exact test to quantify the significance of the overlap between genes and functions and report one-sided P-value (see details in Supplementary Material). GUILDify v2.0 is the first server that permits the use of gene prioritisation results to explore disease-disease relationships with such simplicity and flexibility.

Prioritisation of drug targets
GUILDify v2.0 now allows to search by a drug in addition to a phenotype and returns a list of drugtarget associations integrated from DrugBank [17], DGIdb [18], DrugCentral [19] and ChEMBL [20] (see details in Supplementary Material). This new functionality allows the characterisation of the neighbourhood of the drug in the PIN, i.e. neighbouring proteins to those targeted by the drug, and thus providing insights on the potential mechanism of action of the drug. Moreover, the novel feature of assessing the overlap between two network expansion runs (i.e. two job IDs) can also be applied in multiple scenarios to: (i) identify the similarity between the neighbourhood of two drugs in the PIN, which can be useful to identify drug interactions; (ii) compare the neighbourhood of a disease with the neighbourhood of a drug in the PIN, which can be applied to drug repurposing. Such novel features make GUILDify v2.0 one of the most easy-to-use and flexible web servers to inspect the effect of drugs in the PIN.

Screening diseases to identify potential new indications of known drugs
Building upon new technical developments mentioned above, GUILDify v2.0 now offers a novel drug repurposing functionality. Given a job ID associated with a drug (or a list of drug targets), this feature automatically calculates the overlap of genes (or functions) between the given drug and a set of precalculated diseases. Details on the method and validation of drug repurposing are described in detail at Supplementary Material.

Tissue and species-specific PINs
The analysis of the protein interactions in a tissue-specific context is becoming increasingly relevant to understand genetic diseases and find improved treatments [21]. We have included tissue-specific networks derived from 22 different human tissues (see Supplementary Table S1). To create these networks, we filtered the interactions in the global PIN using RNAseq data from GTEx [22], keeping only the interactions between proteins encoded by genes that are expressed in a given tissue (i.e. considering only transcripts with TPM (transcripts per kilobase million) expression values of 1 or higher (see details in Supplementary Material). We have also included 7 species-specific PINs derived from experimentally determined protein-protein interactions. Although the coverage of interactomic data for some species is low (e.g., 11,943 interactions in rat vs 320,337 interactions in human), these PINs provide a reliable backbone for interactome-based analyses (e.g., in preclinical research) as opposed to PINs generated by predicted interactions based on homology information.
To investigate the increase in the number of disease-gene associations between versions 1 and 2 of GUILDify, we checked the number of associations for the lowest-level non-obsolete diseases from Disease Ontology [32] that were available in our repositories (2,190 terms). GUILDify v1 contains gene associations for 1,505 diseases and 4,171 genes (2.8 genes per disease), while updated GUILDify v2.0 has gene associations for 2,064 diseases and 11,615 genes (5.6 genes per disease on average).

Functional-coherency based selection of top-ranking genes
One of the main issues when working with disease-gene prioritisation is to select the most relevant (top ranked) genes associated with a given disease. The user can select top 1% or 2% highest scoring genes among all the proteins in the PIN as top ranked genes. In GUILDify v2.0, we also introduced a cutoff based on the functional validation approach described in Ghiassian et al. [5] and provided a new panel visualising the significance of the functional enrichment (P-value) as a function of the number of top-ranking genes included in the validation (implemented in Plotly). In brief, the highest-scoring non-seed proteins are iteratively included in the top-ranking set, provided that they maintain the functional coherency of the existing top-ranking set (see details in Supplementary Material). Note that this approach might be too restrictive for some complex diseases in which the information on known disease-gene associations is limited, failing to represent the functional diversity involved in the disease.

Visualisation of the top-ranking subnetwork
GUILDify v2.0 uses the JavaScript-based network visualisation library, Cytoscape.js [33], to show the subnetwork of the top-ranking proteins and the drugs targeting these proteins. The user can decide the cutoff to define the top ranked proteins to be visualised (top 1%, top 2% or functionally-coherent as mentioned above). In addition to seeds (green hexagons), top-ranking proteins (yellow circles) and drugs (blue diamonds), the subnetwork includes the proteins that connect the seeds to the largest connected component induced by seeds (named "linkers" and shown as grey circles, see details in Supplementary Material).

R package
We have included an R package in order to provide programmatic access to GUILDify v2.0 through R statistical computing environment (https://www.r-project.org/). The package implements methods to query and retrieve results from the web server as an R data frame, allowing users to run multiple queries for more high-throughput and/or systematic analyses. The package and documentation are available online at: http://sbi.upf.edu/guildify2. If the user provides a keyword (or set of keywords) describing a phenotype or drug, the server searches genes containing the keyword in BIANA knowledge database (i.e. integrating information from many resources), otherwise it uses the list of provided gene names. The server shows the selected seeds, which can still be filtered and selected by the user. Then, for the prioritisation parameters the user can select to run the "disease module detection algorithm" (DIAMOnD, downloaded from https://github.com/dinaghiassian/DIAMOnD) [5] or to use one of the several prioritisation algorithms from the GUILD package (default value: NetScore with default parameters). Finally, to retrieve results, the required input is the job ID of a previous run, while for calculating genetic and functional overlap the inputs are two job IDs of previous runs.

Case studies 1. Exploring the mechanistic links between rheumatoid arthritis and asthma
In multiple studies, rheumatoid arthritis and asthma are linked as a potential comorbidity, although the mechanisms underlying this association remain unclear [34]. Using the new functionality of GUILDify v2.0, we can assess the overlap between diseases and thus propose a potential mechanism to explain the association between them. Querying for "rheumatoid arthritis" and "asthma" returns 156 and 96 seeds, respectively coming from DisGeNET, OMIM, and UniProt. There are already 12 seeds in common (Fisher's exact test, one-sided P-value = 1.4·10 -9 ) and 18 common functions out of the total enriched functions of the seeds (P-value = 9.3·10 -23 ). After running GUILDify v2.0, we select 290 and 181 top ranked genes using functional-coherency based cutoff for rheumatoid arthritis and asthma, respectively. We find that the number of common genes increases to 55 (yielding a P-value = 5.9·10 -48 ), while the number of common functions (biological processes) increases to 31 (P-value = 8.1·10 -46 ). The link between these diseases is significant even when the seeds are removed from the top-ranking genes (see Supplementary Material). Among the shared top-ranking genes, we find Tumor Necrosis Factor (TNF), which has been proposed as a potential drug target for asthma and rheumatoid arthritis, and highlighted as a potential precursor of the comorbidity [12]. We also find HLA-DRB1 and several interleukins (IL18, IL1B, IL3), taking part of the immune response potentially involved in both diseases. Furthermore, the most common enriched functions relate to inflammatory processes such as "inflammatory response", "positive regulation of interferon-gamma production" and "positive regulation of T-helper 1 cell cytokine production". These functions appear again if we check the functions enriched by the common genes, along with other functions such as "T-helper 1 type immune response" or "negative regulation of type 2 immune response", highlighting the involvement of type 1 immune response in both diseases. As negative controls, we repeated the analysis using other disease pairs that are not likely to be comorbid such as "rheumatoid arthritis" -"breast cancer" and "asthma" -"breast cancer", finding drastically reduced number of genes in the overlap between these disease pairs (see Supplementary Material). The results can be further explored in Figure 1 and in the pre-calculated examples section of the web. Additionally, we compared the functional relevance of the top-ranking genes identified by NetScore with DIAMOnD, based on the analysis in Sharma et al. [35] (see Supplementary Material). We checked the enrichment of top-ranking genes among the pathways containing the seed genes of asthma and rheumatoid arthritis, showing that both methods significantly recover the pathways in each disease. Furthermore, NetScore identified more genes that belonged to the pathways shared between asthma and rheumatoid arthritis compared to DIAMOnD.

Study of the mechanism of non-small cell lung carcinoma drugs
Non-small cell lung cancer (NSCLC) is the most common type of lung cancer. Typically induced by exposure to toxic substances, the NSCLC pathology has been specially associated with a mutation in the Epidermal Growth Factor Receptor (EGFR) [36]. In a recent study, 9 drugs were proposed to treat this disease [37], 6 of them having drug-target interactions reported: Afatinib, Ceritinib, Crizotinib, Erlotinib, Gefitinib and Palbociclib. Given that we can now identify potentially new relationships between drugs and diseases using drugs as queries, we investigate whether the neighbourhood of the targets of these drugs in the PIN significantly overlaps with the neighbourhood of the genes associated with NSCLC. We used GUILDify v2.0 to define this neighbourhood. We observe that the genetic overlap is always significant, except for one of the drugs (Palbociclib, see Table 1).
We confirm the significance by applying the same approach to breast cancer, showing that Ceritinib, Crizotinib and Palbociclib produce a significant genetic overlap, although the number of common genes in each case is substantially lower than it is in NSCLC (see Table 1). These results are consistent with the fact that Palbociclib is primarily indicated for breast cancer and it has been recently repurposed for NSCLC [38]. The small but significant overlap of Ceritinib and Crizotinib suggests that these two drugs might also be considered as potential repurposing candidates. We note that using the top-ranking nodes increases the significance of the genetic overlap (with lower Pvalues) compared to the overlap using only seeds (genes associated with a pathophenotype and direct targets of drugs). The significant overlap between the top ranked genes identified using these drugs and the top ranked genes for NSCLC (but not for the top ranked genes for breast cancer) suggests that GUILDify v2.0 can help understanding how drugs exert their action on certain diseases.
Indeed, the characterisation of the neighbourhood in the PIN that is affected by drugs opens a wide range of possibilities for drug repurposing research.

Overlap and functional enrichment analysis
We use one-sided Fisher's exact test to calculate the overlap between two sets of genes or functions and use Benjamini-Hochberg multiple hypothesis testing procedure (where applicable). The functions enriched among seeds and top-ranking nodes as well as common functions between two diseases are calculated as explained in a previous work [12] (see details in Supplementary Material).

Conflicts of Interest Statement
None declared.   with "non small cell lung carcinoma" and "breast cancer" (top ranking genes and seeds) and the subnetwork of genes associated with the targets of drugs Afatinib, Ceritinib, Crizotinib, Erlotinib, Gefitinib and Palbociclib (drug targets and top-ranking genes obtained with GUILDify v2.0). P-values shown have been corrected using the Benjamini-Hochberg correction for multiple tests. Results with non-significant P-value are highlighted in red.