Published February 10, 2025 | Version v1.0.0
Dataset Open

Datasets for OntoClue Project

  • 1. ROR icon ZB MED - Information Centre for Life Sciences

Contributors

Project member:

Description

Description

This release contains the datasets and files associated with the OntoClue project, which investigates various text embedding techniques for assessing document-to-document similarity in biomedical literature. The project primarily utilizes the RELISH Corpus [1], a comprehensive dataset curated by experts that includes relevance annotations for document pairs based on their similarity. This release includes datasets for establishing ground truth, as well as retrieved titles and abstracts for all PMIDs in the RELISH database. The files also contain preprocessed tokens for use in text embedding neural network models, as well as annotated tokens based on the MeSH (Medical Subject Headings) [2] vocabulary.  

Data Structure and Files

  1. missing_pmids.tsv: List of PMIDs for which titles and abstracts could not be retrieved
  2. relevance_matrix.tsv :   Ground truth dataset file derived from the RELISH JSON file containing 189,634 documents pairs, with three columns: PMID1 (reference article), PMID2 (assessed article), and relevance (relevance score between the two documents). Consists of 68,479 completely relevant pairs, 65,406 partially relevant pairs and 55,749 irrelevant pairs.
  3. relish_documents.tsv:  Contains retrieved RELISH documents, including PMID, title and abstract (163,189 articles)
  4. relish_bert_input_text.zip: Preprocessed titles and abstracts for use with BERT-based models
  5. relish_preprocessed_normal_tokens.zip: Document text preprocessed for use with all embeddings approaches
  6. relish_normal_split_datasets.zip:  Preprocessed document text split into training, validation and test datasets
  7. relish_xml_files.zip: RELISH articles retrieved as XML files
  8. relish_annotated_xml_files.zip: Annotated XML files of RELISH articles (163,189 articles)
  9. relish_preprocessed_annotated_tokens.zip: Document text preprocessed for use with all embeddings approaches, with annotations
  10. relish_annotated_split_datasets.zip: Preprocessed and annotated document text split into a training, validation and test datasets
  11. relish_ground_truth_split_datasets.zip: Ground truth dataset split into a training, validation and test datasets

Data Collection

The RELIHS dataset v1 was downloaded from the corresponding FigShare record [3] on January 24th, 2022. The dataset, in JSON format, contains PubMed IDs (PMIDs) along with relevance assessments for document pairs. Using the BioC API, we retrieved XML files containing the PMID, title, and abstract for each unique entry in the RELIHS JSON file. Any PMIDs that failed to retrieve, or lacked titles and abstracts, were recorded as missing. In total, approximately 163,189 XML files were successfully retrieved. These XML files were also converted into a TSV file with three columns: PMID, title, and abstract. The text from the titles and abstracts was further preprocessed for use in various approaches.

References

[1] Peter Brown, RELISH Consortium , Yaoqi Zhou, Large expert-curated database for benchmarking document similarity detection in biomedical literature search, Database, Volume 2019, 2019, baz085, https://doi.org/10.1093/database/baz085

[2] Lipscomb C. E. (2000). Medical Subject Headings (MeSH). Bulletin of the Medical Library Association88(3), 265–266.

[3] Brown, Peter (2019). RELISH_v1. figshare. Dataset. https://doi.org/10.6084/m9.figshare.7722905.v1

Files

relish_annotated_split_datasets.zip

Files (2.4 GB)

Name Size Download all
md5:d1ee317a2932688bb5e89c4d22c4fede
12 Bytes Download
md5:130cd40461d41870d9a4328dfd295b86
4.7 MB Download
md5:bb7597568e139a536c7086892d300b4e
118.9 MB Preview Download
md5:97a32c895c6bbc31d90b6e2768908e53
317.6 MB Preview Download
md5:608e6d28c09b1fc9129952e38846feee
76.3 MB Preview Download
md5:2a0c9c97b51ec328bd03445af36688e0
277.5 MB Download
md5:fedbe99d69e4244451ffcb5501fd1020
699.7 kB Preview Download
md5:0804aa493b84b2dd5fb6a492f16a9aa1
541.3 MB Preview Download
md5:9e5dd47bf88d35582f290025361a0292
152.3 MB Preview Download
md5:3bc3179c5972ad62c4e3c1657497f6ca
692.2 MB Preview Download
md5:dae8d6048af1ca67f6a655f92ebbf89b
186.2 MB Preview Download

Additional details

Related works

Is required by
Model: 10.5281/zenodo.14826813 (DOI)

Funding

Deutsche Forschungsgemeinschaft
STELLA Project 407518790
Deutsche Forschungsgemeinschaft
NFDI4DataScience 460234259

Software

Repository URL
https://github.com/zbmed-semtec/relish-preprocessing
Programming language
Python
Development Status
Active