A Meta-Path-Based Prediction Method for Disease Comorbidities

— The simultaneous presence of diseases worsens the prognosis of patients and makes their treatment difficult. Identifying the co-occurrence of diseases is key to improving the situation of patients and designing effective therapeutic strategies. On the one hand, the increasing availability of clinical information opens new ways to unveil hidden relationships between diseases. On the other hand, heterogeneous information networks have been used in recent years to discover novel knowledge from disease data, including symptoms, genes or drugs. The use of meta-paths allows the complex semantics of the relationships between the different types of nodes to be included in heterogeneous networks. In this study, we propose a system to predict disease comorbidities through the use of meta-paths in a heterogeneous network of diseases and symptoms, built from textual sources of public access. The results obtained improve those of similar studies based on biological data, and the predictions calculated for diabetes and Crohn's disease are supported by medical literature. Both the used data and the obtained prediction model are publicly accessible.


INTRODUCTION
The occurrence of one or more additional conditions, known as comorbidity, is widespread among the patients admitted at multidiscipline hospitals. For instance, obese patients often develop type-2 diabetes and hypertension. A number of clinical studies show that disease comorbidity not only causes additional suffering to patients, but also compromises the success of standard treatments compared to patients who have a single disease. In the US, 80% of Medicare spending is dedicated to treating patients with multiple coexisting conditions [1]. For this reason, the accurate prediction of potential disease comorbidities is essential to design more efficient treatment strategies and improve the prognosis of patients.
In recent years, the increasing availability of clinical data has boosted the investigation of unknown relationships between diseases. Given the variety of sources and data, heterogeneous information networks have become a crucial tool for extracting novel knowledge [2], [3]. The identification of new disease-disease relationships using link prediction methods has not only improved our understanding of their etiology and pathogenesis, but has also made it possible to reuse existing treatments in new diseases [4]. Meta-paths, sequences of semantic relationships between nodes of heterogeneous networks, provide a powerful mechanism for the training of link prediction models [5]. For example, two diseases can be connected via disease-gene-disease path, disease-gene-compound-drug-disease path, and so on. Intuitively, the semantics underneath different paths imply different similarities. Formally, a meta-path is a path defined on the graph of network schema = ( , ) and is denoted in the form of → → … → , which defines a composite relation = ∘ ∘ … ∘ between type and , where ∘ denotes the composition operator on relations [6].
The use of meta-path often involves a two-step process to solve the link prediction problem in heterogeneous networks. In the first step, the meta-path-based feature vectors are extracted. In the second step, a regression or classification model is trained to compute the existence probability of a link. For example, Sun et al. proposed PathPredict to solve the problem of co-author relationship prediction following this approach [7]. In [8], Dong et al. present the Metapath2Vec model to maximize the likelihood of preserving both the structure and semantics of a given heterogeneous network and apply its latent embeddings to various network mining tasks, such as node classification, clustering, and link prediction. In contrast to conventional meta-path-based methods, the advantage of latent-space representation learning lies in its ability to model similarities between nodes that are not connected through meta-paths. Recent studies have used heterogeneous networks and meta-paths for the prediction of comorbidities from biological data. Jin et al. built a miRNAgene-disease network to uncover microRNA-mediated 1 https://github.com/pantapps/cbms21 disease comorbidities and potential pathobiological implications [9]. Their method presented an accuracy, measured with the area under the curve of the Receiver Operating Characteristic (AUC-ROC), of 0.65 when inferring the clinically reported disease-disease pairs.
Despite the growing number of clinical texts and their potential as a source of new knowledge, their exploitation in the prediction of comorbidities through heterogeneous networks is limited, partly due to limited access to electronic health records imposed by privacy laws. In this paper, we present a method for predicting comorbidities from public clinical data, based on meta-paths. First, we built a heterogeneous network of diseases and symptoms, and defined the meta-paths. Next, we applied the Metapath2Vec model to tackle link prediction as a supervised learning problem on top of the network embeddings. The AUC-ROC obtained when evaluating the model was 0.74. Finally, we applied the prediction model to type-2 diabetes and Crohn's disease, and found that the results were supported by the medical literature. Figure 1 summarizes the methods schematically. Both the data used and the results obtained are published as supplementary materials, for their validation and reuse 1 .

A. Heterogeneous disease-symptom network
We extracted data on associations between diseases and symptoms from DISNET, a database that integrates phenotypic characteristics of diseases from Wikipedia, Figure 1. Visual summary of the methods applied to generate the comorbidity prediction model described in this paper.
PubMed and MayoClinic, among others [10]. DISNET snapshot 2020-12-15 contains 7,193 diseases associated with 2,103 different symptoms. To extract the disease-disease relationships based on their co-occurrence in the same patient, we used the ShARe corpus published in SemEval / CLEF 2013-2015 evaluations, which contains 300 clinical notes with 12,095 annotated disorders and their attributes [11].
To connect the data from both sources, we used the Search API of the Unified Medical Language System (UMLS) to map the cross-referenced identifiers in DISNET to their Concept Unique Identifier (CUI) [12]. On the one hand, we only included DISNET diseases with a mapping in UMLS. On the other hand, we only selected diseases from the ShARe corpus that contain symptoms in DISNET.
Finally, we used the Stellargraph python library to build the heterogeneous network. Of the total of 5,147 nodes, 3,251 had disease type and 1,896 had symptom type. The 49,741 links were annotated as disease-has_symptom-symptom (46,333) and disease-has_cooccurrence-disease (3,408), according to their nature.

B. Link prediction model
We used Metapath2Vec to learn the embeddings, maximizing the likelihood of preserving both the structure and semantics of the heterogeneous network [8]. First, we split our network into a training graph and a test graph. From each graph, we set aside a sample (10%) of positive and negative edges into a training edge set and a test edge set, respectively. Negative edges are sampled at random by selecting two nodes in the graph and then checking if these edges are connected or not. If not, the pair of nodes is considered a negative sample. Otherwise, it is discarded and the process repeats.
Second, we applied uniform random walks to traverse the training graph and generate a corpus of sentences. A sentence is a list of node IDs, and each node ID is considered a unique word in a dictionary that has size equal to the number of nodes in the graph. The random walk is driven by meta-paths that define the node type order by which the random walker explores the graph. For example, the meta-path diseasesymptom-disease defines a rule for the random walk to traverse the graph starting from a disease node, passing through a symptom node to end on a disease node. All metapaths begin and end on disease type nodes. Figure 2 shows the node and edge types, and the meta-path schema applied for our random walk. Third, we fed the sentence corpus into a Word2Vec model to calculate an embedding vector for each node in the graph. Given a word (node ID), Word2Vec uses the skip-gram algorithm to predict the neighboring words within a specified window. This model gives more importance to words closer to the target word than to the distant ones [13].
Then we applied element-wise multiplication (Hadamard product) on the embeddings of the source and target nodes to calculate edge embeddings for positive and negative edge samples from the training edge set [14]. Finally, we trained a logistic regression classifier with the edge embeddings to predict a binary value indicating whether an edge between two nodes is expected to exist or not.
The heterogeneous network edge list and the trained model are available in the supplementary materials.

C. Model evaluation
To evaluate our predictor, we used the test graph to compute test node embeddings, and then computed AUC-ROC using the test edge set. In order to qualitatively evaluate its performance, we applied our model to predict the comorbidities of type-2 diabetes mellitus and Crohn's disease, and we contrasted the results with data available in the clinical literature [15]- [18].

III. RESULTS
The computed comorbidity prediction model showed an AUC-ROC=0.74. Figure 3 represents the AUC-ROC visually.
One of the advantages of using node embeddings in our approach is the possibility of representing the heterogeneous network in a low dimensional space, in which the graph structural information and graph properties are maximumly preserved. We used the t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the embeddings computed for the nodes and edges, by giving each datapoint a location in a two-dimensional map [19]. Figure 4 shows the t-SNE projection for node embeddings (A) and edge embeddings (B).   Table I and Table II contain the top 20 predicted diseasedisease links (comorbidities) for type-2 diabetes mellitus and Crohn's disease, respectively. In the tables, diseases are sorted by the probability of no co-occurrence (P0 column) in descending order.
An extended version with the top 100 predicted links is available in the supplementary materials.

IV. DISCUSSION
Results show that the presented method allows predicting co-occurrences between diseases from public data on symptoms and diseases, with reasonable accuracy (see Figure  3). The AUC-ROC of our model significantly improves that obtained by Jin et. by applying meta-paths to miRNA data, gene and proteins instead of symptom [9]. However, it is still lower than that of other more advanced models [20].  When applying the model to type-2 diabetes mellitus (see Table I), we obtained results that coincide with the most common comorbidities reported in the clinical literature, such as hypertension, chronic kidney diseases, cardiovascular diseases and visual problems [15], [16]. Other cases, such as degenerative polyarthritis, pneumoperitoneum, avascular necrosis of bone or corn of toe are not among the most common comorbidities, but are reported in the medical literature [21]- [24]. The extended results show numerous cooccurrences of diabetes with fractures (e.g., fracture of cervical spine, fracture of second cervical vertebra, rib fractures). The relationship between diabetes and bone fragility has also been studied [25].
In the case of Crohn's disease, the most common comorbidities are intestinal diseases (colon cancer, rectal cancer), respiratory diseases, vascular diseases, and arthritis. The results shown in Table II include diseases of these types [17], [18]. As in the case of diabetes, we find very specific cases such as pathologic fistula, Mallory-Weiss Syndrome and multiple sclerosis, described in the clinical literature [26], [27].
Notwithstanding the aforementioned results, our study presents some limitations. On the one hand, the number of diseases with a significant probability of comorbidity (> 0.95) is high, representing 17.32% and 11.05% for type-2 diabetes mellitus and Crohn's disease, respectively. This suggests that the classification is not specific enough. On the other hand, the data set contains common and/or unspecified diseases such as carcinoma, cancer or vitamin deficiency, which could affect the results. A pre-filtering of the data set to eliminate these types of entries could potentially improve the specificity of the system.

V. CONCLUSIONS
Improving our knowledge about disease comorbidities can improve the treatment of patients, saving not only suffering but also healthcare resources. In this paper, we propose the exploitation of data from open clinical texts through a metapath-based network analysis to predict the probability of cooccurrence of two diseases. Both the used data and the obtained results are publicly available.
The main advantage of our approach is its good complexity-performance ratio. Methods based on meta-paths with random walks are intuitive and simple, describing the relationships between data in a semantic and interpretable way. However, they are less powerful than more complex methods, such as those based on graph neural networks (GNNs). GNNs are able to incorporate both latent and explicit features of the graph, demonstrating state-of-the-art performance on numerous problems, including link prediction [28].
As future work, we propose to apply methods based on GNNs to the prediction of comorbidities from textual data and compare the results with those obtained in the present study, considering the complexity-performance relationship.

ACKNOWLEDGMENT
The work is a result of the project "DISNET (Creation and analysis of disease networks for drug repurposing from heterogeneous data sources applied to rare diseases)", that is being developed under grant "RTI2018-094576-A-I00" from the Spanish Ministerio de Ciencia, Inovación y Universidades. Lucía Prieto Santamaría's work is supported by "Programa de fomento de la investigación y la innovación (Doctorados