Bio-ML: Machine Learning-Friendly Biomedical Datasets for Equivalence and Subsumption Ontology Matching

Yuan He; Jiaoyan Chen; Hang Dong; Ernesto Jiménez-Ruiz; Ali Hadian; Ian Horrocks

doi:10.5281/zenodo.13119437

Published July 28, 2024 | Version OAEI Bio-ML 2024

Dataset Open

Bio-ML: Machine Learning-Friendly Biomedical Datasets for Equivalence and Subsumption Ontology Matching

1. University of Oxford
2. City, University of London
3. Samsung Research UK

This version is used in the Bio-ML track of the OAEI 2024; the only change compared to the OAEI 2023 is the deletion of certain training subsumption mappings.

Overview

The purpose of these datasets is to support equivalence and subsumption ontology matching.

There are five ontology pairs extracted from MONDO and UMLS:

Source	Task	Category	#SrcCls	#TgtCls	#Ref (equiv)	#Ref (subs)
Mondo	OMIM-ORDO	Disease	9,648	9,275	3,721	103
Mondo	NCIT-DOID	Disease	15,762	8,465	4,686	3,338 (-1)
UMLS	SNOMED-FMA	Body	34,418	88,955	7,256	5,453 (-53)
UMLS	SNOMED-NCIT	Pharm	29,500	22,136	5,803	4,224 (-1)
UMLS	SNOMED-NCIT	Neoplas	22,971	20,247	3,804	213

The "-" numbers reflect the changes due to lthe deletion of certain training subsumption mappings.

The main track is available at "bio-ml", where each pair is associated with a task folder, containing the source and target ontologies, reference equivalence mappings (in "refs_equiv"), reference subsumption mappings ("refs_subs").

The special sub-track is available at "bio-llm", where each pair is associated with a task folder, containing the source and target ontologies, and the test candidate mappings.

Citation

Bio-ML (Main Track)

```
@inproceedings{he2022machine,
  title={Machine learning-friendly biomedical datasets for equivalence and subsumption ontology matching},
  author={He, Yuan and Chen, Jiaoyan and Dong, Hang and Jim{\'e}nez-Ruiz, Ernesto and Hadian, Ali and Horrocks, Ian},
  booktitle={International Semantic Web Conference},
  pages={575--591},
  year={2022},
  organization={Springer}
}
```

Bio-LLM (Sub-track)

```
@article{he2023exploring,
  title={Exploring large language models for ontology alignment},
  author={He, Yuan and Chen, Jiaoyan and Dong, Hang and Horrocks, Ian},
  journal={arXiv preprint arXiv:2309.07172},
  year={2023}
}
```

Important Links

See detailed documentation at: https://krr-oxford.github.io/DeepOnto/bio-ml.
See the OAEI Bio-ML track at: https://www.cs.ox.ac.uk/isg/projects/ConCur/oaei/
See our resource paper for the original Bio-ML at arxiv or springer (accepted at ISWC-2022 and nominated as the best resource paper candidate). See our poster paper for the Bio-LLM sub-track at arxiv (accepted at ISWC-2023 Posters & Demos).

Changelog

The only change in this version compared to the OAEI 2023 is the deletion of certain training subsumption mappings that can be directly exploited through deductive reasoning.

Files

ncit-doid.zip

Files (41.7 MB)

Name	Size	Download all
ncit-doid.zip md5:d5d82bb5c17c89a2b5f9aa9a5b7ad15a	5.9 MB	Preview Download
omim-ordo.zip md5:7adf13e1726071f8922fdae9d653aeed	4.5 MB	Preview Download
snomed-fma.body.zip md5:94c9ec76584a204aa7dc6bf5985ba66d	13.2 MB	Preview Download
snomed-ncit.neoplas.zip md5:960c58698e4cc1ebbd2bde33c49bf1bd	7.2 MB	Preview Download
snomed-ncit.pharm.zip md5:c4ed2701ff569b2cb92e37714636a112	10.9 MB	Preview Download

	All versions	This version
Views	3,261	272
Downloads	979	178
Data volume	80.5 GB	1.5 GB

Bio-ML: Machine Learning-Friendly Biomedical Datasets for Equivalence and Subsumption Ontology Matching

Creators

Description

Overview

Citation

Important Links

Changelog

Files

ncit-doid.zip

Files (41.7 MB)