Datasets for OntoClue Project

Ravinder, Rohitha; Geist, Lukas; Rebholz-Schuhmann, Dietrich; Castro, Leyla Jael

doi:10.5281/zenodo.14801641

Published February 10, 2025 | Version v1.0.0

Dataset Open

Datasets for OntoClue Project

1. ZB MED - Information Centre for Life Sciences

Contributors

Project member:

Fellerhoff, Tim

Description

This release contains the datasets and files associated with the OntoClue project, which investigates various text embedding techniques for assessing document-to-document similarity in biomedical literature. The project primarily utilizes the RELISH Corpus [1], a comprehensive dataset curated by experts that includes relevance annotations for document pairs based on their similarity. This release includes datasets for establishing ground truth, as well as retrieved titles and abstracts for all PMIDs in the RELISH database. The files also contain preprocessed tokens for use in text embedding neural network models, as well as annotated tokens based on the MeSH (Medical Subject Headings) [2] vocabulary.

Data Structure and Files

missing_pmids.tsv: List of PMIDs for which titles and abstracts could not be retrieved
relevance_matrix.tsv : Ground truth dataset file derived from the RELISH JSON file containing 189,634 documents pairs, with three columns: PMID1 (reference article), PMID2 (assessed article), and relevance (relevance score between the two documents). Consists of 68,479 completely relevant pairs, 65,406 partially relevant pairs and 55,749 irrelevant pairs.
relish_documents.tsv: Contains retrieved RELISH documents, including PMID, title and abstract (163,189 articles)
relish_bert_input_text.zip: Preprocessed titles and abstracts for use with BERT-based models
relish_preprocessed_normal_tokens.zip: Document text preprocessed for use with all embeddings approaches
relish_normal_split_datasets.zip: Preprocessed document text split into training, validation and test datasets
relish_xml_files.zip: RELISH articles retrieved as XML files
relish_annotated_xml_files.zip: Annotated XML files of RELISH articles (163,189 articles)
relish_preprocessed_annotated_tokens.zip: Document text preprocessed for use with all embeddings approaches, with annotations
relish_annotated_split_datasets.zip: Preprocessed and annotated document text split into a training, validation and test datasets
relish_ground_truth_split_datasets.zip: Ground truth dataset split into a training, validation and test datasets

Data Collection

The RELIHS dataset v1 was downloaded from the corresponding FigShare record [3] on January 24th, 2022. The dataset, in JSON format, contains PubMed IDs (PMIDs) along with relevance assessments for document pairs. Using the BioC API, we retrieved XML files containing the PMID, title, and abstract for each unique entry in the RELIHS JSON file. Any PMIDs that failed to retrieve, or lacked titles and abstracts, were recorded as missing. In total, approximately 163,189 XML files were successfully retrieved. These XML files were also converted into a TSV file with three columns: PMID, title, and abstract. The text from the titles and abstracts was further preprocessed for use in various approaches.

References

[1] Peter Brown, RELISH Consortium , Yaoqi Zhou, Large expert-curated database for benchmarking document similarity detection in biomedical literature search, Database, Volume 2019, 2019, baz085, https://doi.org/10.1093/database/baz085

[2] Lipscomb C. E. (2000). Medical Subject Headings (MeSH). Bulletin of the Medical Library Association, 88(3), 265–266.

[3] Brown, Peter (2019). RELISH_v1. figshare. Dataset. https://doi.org/10.6084/m9.figshare.7722905.v1

Files

relish_annotated_split_datasets.zip

Files (2.4 GB)

Name	Size
missing_pmids.tsv md5:d1ee317a2932688bb5e89c4d22c4fede	12 Bytes	Download
relevance_matrix.tsv md5:130cd40461d41870d9a4328dfd295b86	4.7 MB	Download
relish_annotated_split_datasets.zip md5:bb7597568e139a536c7086892d300b4e	118.9 MB	Preview Download
relish_annotated_xml_files.zip md5:97a32c895c6bbc31d90b6e2768908e53	317.6 MB	Preview Download
relish_bert_input_text.zip md5:608e6d28c09b1fc9129952e38846feee	76.3 MB	Preview Download
relish_documents.tsv md5:2a0c9c97b51ec328bd03445af36688e0	277.5 MB	Download
relish_ground_truth_split_datasets.zip md5:fedbe99d69e4244451ffcb5501fd1020	699.7 kB	Preview Download
relish_normal_split_datasets.zip md5:0804aa493b84b2dd5fb6a492f16a9aa1	541.3 MB	Preview Download
relish_preprocessed_annotated_tokens.zip md5:9e5dd47bf88d35582f290025361a0292	152.3 MB	Preview Download
relish_preprocessed_normal_tokens.zip md5:3bc3179c5972ad62c4e3c1657497f6ca	692.2 MB	Preview Download
relish_xml_files.zip md5:dae8d6048af1ca67f6a655f92ebbf89b	186.2 MB	Preview Download

Additional details

Is required by: Model: 10.5281/zenodo.14826813 (DOI)

Deutsche Forschungsgemeinschaft
STELLA Project 407518790
Deutsche Forschungsgemeinschaft
NFDI4DataScience 460234259

Repository URL: https://github.com/zbmed-semtec/relish-preprocessing
Programming language: Python
Development Status: Active

	All versions	This version
Views	226	226
Downloads	748	748
Data volume	177.8 GB	177.8 GB

Contributors

Project member:

relish_annotated_split_datasets.zip

Files (2.4 GB)

Related works

Funding

Software

Datasets for OntoClue Project

Authors/Creators

Contributors

Project member:

Description

Files

relish_annotated_split_datasets.zip

Files (2.4 GB)

Additional details

Related works

Funding

Software