Ontology Enrichment from Texts (OET): A Biomedical Dataset for Concept Discovery and Placement
Creators
- 1. University of Oxford
- 2. University of Manchester
Description
A biomedical dataset supporting ontology enrichment from texts, by concept discovery and placement, adapting the MedMentions dataset (PubMed abstracts) with SNOMED CT of versions in 2014 and 2017 under the Diseases (disorder) sub-category and the broader categories of Clinical finding, Procedure, and Pharmaceutical / biologic (CPP) product.
The dataset is documented in the work, Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement, on arXiv: https://arxiv.org/abs/2306.14704 (CIKM 2023). The companion code is available at https://github.com/KRR-Oxford/OET.
Out-of-KB mention discovery (including the settings of mention-level data) is further partly documented in the work, Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, on arXiv: https://arxiv.org/abs/2302.07189 (CIKM 2023).
ver3: we revised and updated mention-level data (syn_full, synonym augmentation setting) and the folder structure, and also updated the edge catalogues with complex edges.
ver2: we revised the mention-level data by only keeping out-of-KB mentions (or "NIL" mentions) associated with one-hop edges (including leaf nodes, as <leaf node, NULL>) and two-hop edges in the ontology (SNOMED CT 20140901).
Acknowledgement of data sources and tools below:
* SNOMED CT https://www.nlm.nih.gov/healthit/snomedct/archive.html (and use snomed-owl-toolkit to form .owl files)
* UMLS https://www.nlm.nih.gov/research/umls/licensedcontent/umlsarchives04.html (and mainly use MRCONSO for mapping UMLS to SNOMED CT)
* MedMentions https://github.com/chanzuckerberg/MedMentions (source of entity linking)
* Protege http://protegeproject.github.io/protege/
* snomed-owl-toolkit https://github.com/IHTSDO/snomed-owl-toolkit
* DeepOnto https://github.com/KRR-Oxford/DeepOnto (based on OWLAPI https://owlapi.sourceforge.net/) for ontology processing and complex concept verbalisation
Files
OET-data-ver3.zip
Files
(526.4 MB)
Name | Size | Download all |
---|---|---|
md5:7dee113d009878a5a7249ee071e85d57
|
526.4 MB | Preview Download |