Identification of focus versus background entities in scientific literature
Description
This dataset allows training and evaluating methods for the identification of focus versus background entities in scientific literature. A focus entity is an entity being actively research in a publication while a background entity is an entity that is being discussed in a publication but is not the main focus of the publication.
The dataset has been generated automatically using the MeSH indexing of MEDLINE as reference. The entities of interest in this dataset are microbial pathogens. Entities were annotated using a dictionary approach and then the MeSH indexing of the MEDLINE citation linked to the publication was used to determine the relevance of the entity as focus or background entity.
There are two main types of datasets, one generated from MEDLINE (files medline.*) and another one generated using full text articles from PubMed Central articles (PMC) (files pmc.*). The data sets are split into training and test, which we used in our research. All fields within the files are separated using the pipe "|" character. The MEDLINE citation dataset contains data from over 1M citations while the PMC dataset from over 100k publications (which is a subset of the MEDLINE dataset). In each row in the dataset files, the pathogen of interest has been replaced by the text @PATHOGEN$ and there might be several references of the pathogen in the same row.
Full text articles datasets have been further split into a dataset with explicit separation between sections and another one in which all the full text article appears in one single text string and section names appear at the beginning of each section.
Files
Files
(6.3 GB)
Name | Size | Download all |
---|---|---|
md5:89094727ef5182509285aa3c9a5476c2
|
147.1 MB | Download |
md5:f73bdc1c7715dc67daf6f68a04a38e9e
|
285.9 MB | Download |
md5:ba0ffc488ebe1659f0d41a7aa5df9583
|
974.7 MB | Download |
md5:acc52e4ce4acd934fcbd78c01e930a7b
|
973.1 MB | Download |
md5:e3ffaeb422f2e4c221df9bb70727b029
|
2.0 GB | Download |
md5:e5c4147e56424bbea0d631c7290de5b3
|
2.0 GB | Download |