InTeReC: In-text Reference Corpus - Single References Dataset

Bertin, Marc; Atanassova, Iana

doi:10.5281/zenodo.1203737

Published March 19, 2018 | Version 1.0.0

Dataset Open

InTeReC: In-text Reference Corpus - Single References Dataset

1. ELICO, Université Claude Bernard Lyon 1
2. Centre Tesnière - CRIT, Université de Bourgogne Franche-Comté

This dataset contains a set of sentences extracted from articles published by the Public Library of Science (PLOS) up to September 2013. Information is given on the position of the sentences relative to the article and the section in which they appear, the section type with respect to the four main types of the IMRaD structure, as well as verb phrases that occur in the sentence. Each sentence contains one single in-text reference.

The dataset is in the CSV format. Size: 314023 sentences.

Column list:

journal: journal title
doi: DOI of the article from which the sentence was extracted
article-length: size of the article, as number of sentences
article-pos: position of the sentence in the article, as number of sentences from the beginning of the article
section-length: size of the section, as number of sentences
section-pos: position of the sentence in the section, as number of sentences from the beginning of the section
section-type: section type (see below)
sentence-text: full text of the sentence
verb-phrases: a list of verb phrases that occur in the sentence, comma separated

Possible section types are:

I: Introduction
M: Methods
R: Results
D: Discussion
MR: Methods and Results
RD: Results and Discussion

Full description of the construction of the dataset is published in:

Marc Bertin and Iana Atanassova (2018) InTeReC : an In-text Reference corpus for applying Natural Language Processing to Bibliometrics. Bibliometric-enhanced Information Retrieval: 7th International BIR workshop (7th BIR workshop) at the 40th European Conference on Information Retrieval (ECIR).

Files

interec-singleref-v1.csv

Files (84.2 MB)

Name	Size	Download all
interec-singleref-v1.csv md5:a2fbfc1f042e346dc7b52d094c02d278	84.2 MB	Preview Download

Additional details

Marc Bertin and Iana Atanassova (2018) InTeReC : an In-text Reference corpus for Applying Natural Language Processing to Bibliometrics. Bibliometric-enhanced Information Retrieval: 7th International BIR workshop (7th BIR workshop) at the 40th European Conference on Information Retrieval (ECIR)

	All versions	This version
Views	942	942
Downloads	1,423	1,421
Data volume	132.2 GB	132.0 GB

InTeReC: In-text Reference Corpus - Single References Dataset

Creators

Description

Files

interec-singleref-v1.csv

Files (84.2 MB)

Additional details

References