Towards predicting compound activity by traversing biomedical knowledge graphs
- 1. University of Sheffield, Evotec UK
New drug discovery remains central to the aspiration of improving health care. Nevertheless, the drug discovery process is complex and stubbornly resource intensive. The conceptualisation of biological systems as networks along with their representation using suitable graph data models has opened the door for the adaptation of a great diversity of machine learning methods that exploit the relational nature of such data. For drug discovery, a particularly rich avenue for network-based knowledge discovery has been to cast compound property prediction as a knowledge graph completion, or link prediction, problem. In a biomedical knowledge graph, which is in essence a heterogeneous network integrating the relationships between entities such as genes, proteins, compounds and diseases, a variety of interesting properties of a given entity are encoded as direct links to other entities, and these properties may correlate with more complex patterns within the graph.
Identifying and exploiting such associations lies at the heart of drug discovery when framed as a knowledge graph completion problem. This poster summarises the results of the initial efforts to explore and tackle it in this form.
As a first step, a knowledge graph was created by integrating public data sources from the areas of proteomics, chemistry and pharmacology. Data from Ensembl (Gene and Protein nodes), ChEMBL (Compound, Assay and Measurement nodes), the Experimental Factor Ontology (Disease nodes) and the Gene Ontology (Biological Process nodes), among others, have been merged into a single graph.
The resulting resource therefore displays a rich connectivity between key entities relevant to the drug discovery process. The central hypothesis of this work is that this connectivity encodes consistent (and as yet unknown) patterns that allow the inference of untested compounds’ activity in assays of interest.
Whereas much, if not most, of the recent literature on biomedical knowledge graph completion utilises deep graph embedding models, this poster highlights an approach to the task based on traversing the observable graph. Tackling the task in this manner serves to both build familiarity with the underlying data as well as achieve inference model transparency and explicability – properties that are of great value in drug discovery.
Inspired by previous research with close links to logical inference (Lao et al 2011, Mitchell and Gardner 2015), the approach used in this work leverages observable, rather than latent, knowledge graph topological properties to enable the inference of compound activity in a set of kinase assays. The method relies on the characteristics of the paths within the knowledge graph between a given candidate compound and a target assay to predict the likelihood of a direct connection between the two, which would signify that the compound would demonstrate activity in the assay if tested in a laboratory experiment. Follow-on work may further investigate the suitability of knowledge graph Horn rule mining approaches as detailed by Galárraga et al (2013/2015).
It is intended that the lessons from the exploration summarised in the poster will inform the development of methodologies that combine the transparency of graph traversal-based techniques with the learning potential of deep embedding techniques later on in the first author’s PhD project.