The following Python script downloads the BioKG knowledge graph from the Open Graph Benchmark (OGB).
import numpy as np
import copy
import json
from ogb.linkproppred import LinkPropPredDataset
= LinkPropPredDataset(name = "ogbl-biokg")
dataset
= dataset.get_edge_split()
split_edge = split_edge["train"], split_edge["valid"], split_edge["test"]
train_edge, valid_edge, test_edge = dataset[0] # graph: library-agnostic graph object graph
The edge index in the graph
object is converted into JSON format to be read into R.
print(graph.keys())
= graph["edge_index_dict"].copy() edge_index
To make the conversion, the dictionary keys must be renamed (i.e., cannot be tuples).
= list(edge_index.keys())
old_keys for old_name in old_keys:
= "--".join(old_name)
new_name = edge_index[old_name]
edge_index[new_name] del edge_index[old_name]
A special numpy encoder is defined, borrowed from this StackOverflow post.
class NumpyEncoder(json.JSONEncoder):
""" Special json encoder for numpy types """
def default(self, obj):
if isinstance(obj, (np.int_, np.intc, np.intp, np.int8,
np.int16, np.int32, np.int64, np.uint8,
np.uint16, np.uint32, np.uint64)):return int(obj)
elif isinstance(obj, (np.float_, np.float16, np.float32,
np.float64)):return float(obj)
elif isinstance(obj, (np.ndarray,)):
return obj.tolist()
return json.JSONEncoder.default(self, obj)
Finally, we convert the edge index to JSON and write to a file.
= json.dumps(edge_index, cls = NumpyEncoder)
edge_index_json
with open('inst//extdata//edge_index.json', 'a') as f:
+ '\n') f.write(edge_index_json
Read the JSON dataset saved previously.
# load libraries
library(data.table)
library(purrr)
library(magrittr)
library(rjson)
# load metapaths library
library(metapaths)
# read data
biokg = fromJSON(file = "inst/extdata/edge_index.json")
Create a function to convert the edge list to a data.table
. Note that the node IDs are specific to each type, so we must add a type-specific prefix.
convert_biokg = function(sub_kg, sub_label) {
# split label
split_label = strsplit(sub_label, "--")[[1]]
# create data table
kg_dt = data.table(Origin = paste(split_label[1], sub_kg[[1]], sep = "_"),
Destination = paste(split_label[3], sub_kg[[2]], sep = "_"),
OriginType = split_label[1], DestinationType = split_label[3],
EdgeType = split_label[2])
}
Now, map the conversion function over the biokg
list.
biokg_edge_list = imap_dfr(biokg, convert_biokg)
biokg_node_list = get_node_list(biokg_edge_list)
head(biokg_edge_list)
Check that the counts of each node type conform with graph["num_nodes_dict"]
from the Python script.
disease | drug | function | protein | sideeffect |
---|---|---|---|---|
10687 | 10533 | 45085 | 17499 | 9969 |
table(biokg_node_list$NodeType)
Randomly sample the knowledge graph to generate a small test set.
biokg_graph = igraph::graph.data.frame(biokg_edge_list,
vertices = biokg_node_list,
directed = T)
Save node list and edge list to file.
biokg_graph = igraph::graph.data.frame(biokg_edge_list,
vertices = biokg_node_list,
directed = T)