Link-prediction on Biomedical Knowledge Graphs

Cattaneo, Alberto; Justus, Daniel; Bonner, Stephen; Martynec, Thomas

doi:10.5281/zenodo.15740200

Published June 27, 2025 | Version 2.0

Dataset Open

Link-prediction on Biomedical Knowledge Graphs

1. Graphcore (United Kingdom)

Release of code and experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (Machine Learning for Life and Material Sciences workshop @ ICML2024) and The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models.

Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.

Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).

On each dataset, five different KGE models were compared: TransE, DistMult, RotatE, TripleRE, ConvE. Hyperparameters were tuned on the validation split (see final train configurations in train/scripts). We release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.

Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.

Inside experimental_data.zip, the following files are provided.

datasets/{dataset}: a folder for each dataset, containing
- {dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the datasets. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.
- test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE", "ConvE"] with the rank of the ground-truth tail in the ordered list of predictions made by the five KGE models;
- entity_dict.csv: list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);
- relation_dict.csv: list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).
train: code to reproduce training (and validation) of the five KGE models, using the BESS-KGE distribution framework.
- train/scripts: executable scripts, with specifications of the final hyperparameters for all models and datasets.
notebooks: Jupyter notebooks for data analysis and generation of all the figures in the paper.

The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the five KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

Files

experimental_data.zip

Files (1.5 GB)

Name	Size	Download all
experimental_data.zip md5:7fa9170146a6e3a0c94589a7f7b2ac29	65.5 MB	Preview Download
top_100_tail_predictions.zip md5:0e226efa1ab4b0778f26822c35204830	1.4 GB	Preview Download

Additional details

Available: 2025-06-27

Repository URL: https://github.com/graphcore-research/kg-topology-toolbox
Programming language: Python
Development Status: Active

	All versions	This version
Views	216	32
Downloads	67	0
Data volume	40.7 GB	0 Bytes

Link-prediction on Biomedical Knowledge Graphs

Files

experimental_data.zip

Files (1.5 GB)

Additional details

Related works

Dates

Software

Link-prediction on Biomedical Knowledge Graphs

Creators

Description

Files

experimental_data.zip

Files (1.5 GB)

Additional details

Related works

Dates

Software