Published March 19, 2018 | Version 1.0.0
Dataset Open

InTeReC: In-text Reference Corpus - Single References Dataset

  • 1. ELICO, Université Claude Bernard Lyon 1
  • 2. Centre Tesnière - CRIT, Université de Bourgogne Franche-Comté

Description

This dataset contains a set of sentences extracted from articles published by the Public Library of Science (PLOS) up to September 2013. Information is given on the position of the sentences relative to the article and the section in which they appear, the section type with respect to the four main types of the IMRaD structure, as well as verb phrases that occur in the sentence. Each sentence contains one single in-text reference.

The dataset is in the CSV format. Size: 314023 sentences.

Column list:

  • journal: journal title
  • doi: DOI of the article from which the sentence was extracted
  • article-length: size of the article, as number of sentences
  • article-pos: position of the sentence in the article, as number of sentences from the beginning of the article
  • section-length: size of the section, as number of sentences
  • section-pos: position of the sentence in the section, as number of sentences from the beginning of the section
  • section-type: section type (see below)
  • sentence-text: full text of the sentence
  • verb-phrases: a list of verb phrases that occur in the sentence, comma separated

Possible section types are:

  • I: Introduction
  • M: Methods
  • R: Results
  • D: Discussion
  • MR: Methods and Results
  • RD: Results and Discussion

 

Full description of the construction of the dataset is published in:

Marc Bertin and Iana Atanassova (2018) InTeReC : an In-text Reference corpus for applying Natural Language Processing to Bibliometrics. Bibliometric-enhanced Information Retrieval: 7th International BIR workshop (7th BIR workshop) at the 40th European Conference on Information Retrieval (ECIR).

Files

interec-singleref-v1.csv

Files (84.2 MB)

Name Size Download all
md5:a2fbfc1f042e346dc7b52d094c02d278
84.2 MB Preview Download

Additional details

References

  • Marc Bertin and Iana Atanassova (2018) InTeReC : an In-text Reference corpus for Applying Natural Language Processing to Bibliometrics. Bibliometric-enhanced Information Retrieval: 7th International BIR workshop (7th BIR workshop) at the 40th European Conference on Information Retrieval (ECIR)