Published January 17, 2022 | Version v1
Dataset Open

Identification of focus versus background entities in scientific literature

  • 1. RMIT University

Description

This dataset allows training and evaluating methods for the identification of focus versus background entities in scientific literature. A focus entity is an entity being actively research in a publication while a background entity is an entity that is being discussed in a publication but is not the main focus of the publication.

The dataset has been generated automatically using the MeSH indexing of MEDLINE as reference. The entities of interest in this dataset are microbial pathogens. Entities were annotated using a dictionary approach and then the MeSH indexing of the MEDLINE citation linked to the publication was used to determine the relevance of the entity as focus or background entity.

There are two main types of datasets, one generated from MEDLINE (files medline.*) and another one generated using full text articles from PubMed Central articles (PMC) (files pmc.*). The data sets are split into training and test, which we used in our research. All fields within the files are separated using the pipe "|" character. The MEDLINE citation dataset contains data from over 1M citations while the PMC dataset from over 100k publications (which is a subset of the MEDLINE dataset). In each row in the dataset files, the pathogen of interest has been replaced by the text @PATHOGEN$ and there might be several references of the pathogen in the same row.

Full text articles datasets have been further split into a dataset with explicit separation between sections and another one in which all the full text article appears in one single text string and section names appear at the beginning of each section.

Files

Files (6.3 GB)

Name Size Download all
md5:89094727ef5182509285aa3c9a5476c2
147.1 MB Download
md5:f73bdc1c7715dc67daf6f68a04a38e9e
285.9 MB Download
md5:ba0ffc488ebe1659f0d41a7aa5df9583
974.7 MB Download
md5:acc52e4ce4acd934fcbd78c01e930a7b
973.1 MB Download
md5:e3ffaeb422f2e4c221df9bb70727b029
2.0 GB Download
md5:e5c4147e56424bbea0d631c7290de5b3
2.0 GB Download