Published March 31, 2021 | Version v1
Dataset Open

On the Use of Context for Predicting Citation Worthiness of Sentences in Scholarly Articles

Description

The ACL-cite dataset was created for the paper: “On the Use of Context for Predicting Citation Worthiness of Sentences in Scholarly Articles” published in NAACL 2021. This dataset contains over 2.7 million sentences extracted from scholarly articles (from ACL Anthology [Bird et al.]) and their corresponding citation worthiness labels. The goal of the citation worthiness task is to determine whether a given sentence requires a citation.

There are three CSV files in the dataset:

  • train.csv: 1,625,268 rows
  • dev.csv: 539,085 rows
  • test.csv: 542,081 rows

Each CSV file contains the following columns:

  • document_id: identifier of the paper the sentence was extracted from
  • section: name of the section the sentence was extracted from, (e.g. Abstract, Introduction, etc.)
  • section_id: sequential identifier of the section in the paper 
  • paragraph_id: sequential identifier of the paragraph the sentence was extracted from
  • sentence: the sentence with the citations removed
  • raw_sentence: the raw sentence including the citations
  • sentence_id: sequential identifier of the sentence in the paper
  • label: citation worthiness label

Note: The train/dev/test splits are done at the document_id level.

Files

ACL-cite.zip

Files (162.0 MB)

Name Size Download all
md5:4d9ca5797615967575cb5c4b056f1976
162.0 MB Preview Download

Additional details

Related works

Is derived from
Dataset: https://www.aclweb.org/anthology/ (URL)