There is a newer version of the record available.

Published April 27, 2020 | Version 1.0.1
Dataset Open

OpenBioLink2020

  • 1. Section for Artificial Intelligence and Decision Support, Medical University of Vienna, Vienna, Austria

Description

The OpenBioLink2020 Dataset is a highly challenging biomedical benchmark dataset containing over 5 million positive and negative edges. The test set does not contain trivially predictable, inverse edges from the training set and does contain all different edge types, to provide a more realistic edge prediction scenario. For further information, please check out the github repository.

OpenBioLink2020: directed, high quality is the default dataset that should be used for benchmarking purposes. To allow anayzing the effect of data quality as well as the directionality of the evaluation graph, four variants of OpenBioLink2020 are provided -- in directed and undirected setting, with and without quality cutoff.

Additionally, each graph is available in RDF N3 format (without train-validation-test splits) and BEL.

OpenBioLink is a resource and evaluation framework for evaluating link prediction models on heterogeneous biomedical graph data. It contains benchmark datasets as well as tools for creating custom benchmarks and training and evaluating models.

The OpenBioLink benchmark aims to meet the following criteria:

  • Openly available
  • Large-scale
  • Wide coverage of current biomedical knowledge and entity types
  • Standardized, balanced train-test split
  • Open-source code for benchmark dataset generation
  • Open-source code for evaluation (independent of model)
  • Integrating and differentiating multiple types of biological entities and relations (i.e., formalized as a heterogeneous graph)
  • Minimized information leakage between train and test sets (e.g., avoid inclusion of trivially inferable relations in the test set)
  • Coverage of true negative relations, where available
  • Differentiating high-quality data from noisy, low-quality data
  • Differentiating benchmarks for directed and undirected graphs in order to be applicable to a wide variety of link prediction methods
  • Clearly defined release cycle with versions of the benchmark and public leaderboard

Please note that the OpenBioLink benchmark files contain data derived from external ressources. Licensing terms of these external resources are detailed here.

Files

ALL_DIR.zip

Files (2.1 GB)

Name Size Download all
md5:9c150bdd971c786e7d25c8db12dcc21f
700.1 MB Preview Download
md5:63c38a79f4f651208f7748722c76e6a6
573.7 MB Preview Download
md5:a34a4ad3b93b6a5a1aadaa88b1a0671c
109.4 MB Preview Download
md5:932a79f74dd5f868296dcd9b9de23947
97.5 MB Preview Download
md5:be864b81c737832807faeb96827f1299
36.9 MB Download
md5:8ac3a14b15efd89b2d3dd531f67291ce
224.7 MB Download
md5:670491703b28b60e4fff80172c4c54f8
161.3 MB Preview Download
md5:3838a1fb33cf6a38d8bcdf7e1928b709
129.1 MB Preview Download
md5:2c641f3b62b3135e285bc0ab7e8a73c7
27.3 MB Preview Download
md5:7329604d1702e7bd0a16a62280309940
24.3 MB Preview Download

Additional details

References