Published December 26, 2023 | Version v4
Dataset Open

Ontology Enrichment from Texts (OET): A Biomedical Dataset for Concept Discovery and Placement

  • 1. University of Oxford
  • 2. University of Manchester

Description

A biomedical dataset supporting ontology enrichment from texts, by concept discovery and placement, adapting the MedMentions dataset (PubMed abstracts) with SNOMED CT of versions in 2014 and 2017 under the Diseases (disorder) sub-category and the broader categories of Clinical finding, Procedure, and Pharmaceutical / biologic (CPP) product.

The dataset is documented in the work, Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement, on arXiv: https://arxiv.org/abs/2306.14704 (CIKM 2023). The companion code is available at https://github.com/KRR-Oxford/OET.

Out-of-KB mention discovery (including the settings of mention-level data) is further partly documented in the work, Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking, on arXiv: https://arxiv.org/abs/2302.07189 (CIKM 2023).

ver4: we made a version of mention-level data for out-of-KB discovery and concept placement separately: the former (for out-of-KB discovery) has out-of-KB mentions in training data, while the latter (for concept placement) has only out-of-KB mentions during the evaluation (validation and test) and not in the training data. Also, we split the original "test-NIL.jsonl" (now "test-NIL-all.jsonl") into "valid-NIL.jsonl" and "test-NIL.jsonl" for a better evaluation.

ver3: we revised and updated mention-level data (syn_full, synonym augmentation setting) and the folder structure, and also updated the edge catalogues with complex edges.

ver2: we revised the mention-level data by only keeping out-of-KB mentions (or "NIL" mentions) associated with one-hop edges (including leaf nodes, as <leaf node, NULL>) and two-hop edges in the ontology (SNOMED CT 20140901).

Acknowledgement of data sources and tools below:

* SNOMED CT https://www.nlm.nih.gov/healthit/snomedct/archive.html (and use snomed-owl-toolkit to form .owl files)
* UMLS https://www.nlm.nih.gov/research/umls/licensedcontent/umlsarchives04.html (and mainly use MRCONSO for mapping UMLS to SNOMED CT)
* MedMentions https://github.com/chanzuckerberg/MedMentions (source of entity linking)

* Protégé http://protegeproject.github.io/protege/
* snomed-owl-toolkit https://github.com/IHTSDO/snomed-owl-toolkit
* DeepOnto https://github.com/KRR-Oxford/DeepOnto (based on OWLAPI https://owlapi.sourceforge.net/) for ontology processing and complex concept verbalisation

Files

OET-data-ver4.zip

Files (595.4 MB)

Name Size Download all
md5:d3a4312627b6d8b86c1391281b49e3b6
595.4 MB Preview Download

Additional details

Dates

Accepted
2023-12-26