OpenBioLink2020
- 1. Section for Artificial Intelligence and Decision Support, Medical University of Vienna, Vienna, Austria
Description
The OpenBioLink2020 Dataset is a highly challenging biomedical benchmark dataset containing over 5 million positive and negative edges. The test set does not contain trivially predictable, inverse edges from the training set and does contain all different edge types, to provide a more realistic edge prediction scenario. For further information, please check out the github repository.
OpenBioLink2020: directed, high quality is the default dataset that should be used for benchmarking purposes. To allow anayzing the effect of data quality as well as the directionality of the evaluation graph, four variants of OpenBioLink2020 are provided -- in directed and undirected setting, with and without quality cutoff.
Additionally, each graph is available in RDF N3 format (without train-validation-test splits) and BEL.
OpenBioLink is a resource and evaluation framework for evaluating link prediction models on heterogeneous biomedical graph data. It contains benchmark datasets as well as tools for creating custom benchmarks and training and evaluating models.
The OpenBioLink benchmark aims to meet the following criteria:
- Openly available
- Large-scale
- Wide coverage of current biomedical knowledge and entity types
- Standardized, balanced train-test split
- Open-source code for benchmark dataset generation
- Open-source code for evaluation (independent of model)
- Integrating and differentiating multiple types of biological entities and relations (i.e., formalized as a heterogeneous graph)
- Minimized information leakage between train and test sets (e.g., avoid inclusion of trivially inferable relations in the test set)
- Coverage of true negative relations, where available
- Differentiating high-quality data from noisy, low-quality data
- Differentiating benchmarks for directed and undirected graphs in order to be applicable to a wide variety of link prediction methods
- Clearly defined release cycle with versions of the benchmark and public leaderboard
Please note that the OpenBioLink benchmark files contain data derived from external ressources. Licensing terms of these external resources are detailed here.
Files
ALL_DIR.zip
Files
(2.1 GB)
Name | Size | Download all |
---|---|---|
md5:9c150bdd971c786e7d25c8db12dcc21f
|
700.1 MB | Preview Download |
md5:63c38a79f4f651208f7748722c76e6a6
|
573.7 MB | Preview Download |
md5:a34a4ad3b93b6a5a1aadaa88b1a0671c
|
109.4 MB | Preview Download |
md5:932a79f74dd5f868296dcd9b9de23947
|
97.5 MB | Preview Download |
md5:be864b81c737832807faeb96827f1299
|
36.9 MB | Download |
md5:8ac3a14b15efd89b2d3dd531f67291ce
|
224.7 MB | Download |
md5:670491703b28b60e4fff80172c4c54f8
|
161.3 MB | Preview Download |
md5:3838a1fb33cf6a38d8bcdf7e1928b709
|
129.1 MB | Preview Download |
md5:2c641f3b62b3135e285bc0ab7e8a73c7
|
27.3 MB | Preview Download |
md5:7329604d1702e7bd0a16a62280309940
|
24.3 MB | Preview Download |
Additional details
References
- Anna Breit, Simon Ott, Asan Agibetov, Matthias Samwald, OpenBioLink: A benchmarking framework for large-scale biomedical link prediction, Bioinformatics, , btaa274, https://doi.org/10.1093/bioinformatics/btaa274