On the Use of Context for Predicting Citation Worthiness of Sentences in Scholarly Articles

Gosangi, Rakesh; Arora, Ravneet; Gheisarieha, Mohsen; Mahata, Debanjan; Zhang, Haimin

doi:10.5281/zenodo.4651554

Published March 31, 2021 | Version v1

Dataset Open

On the Use of Context for Predicting Citation Worthiness of Sentences in Scholarly Articles

1. Bloomberg

The ACL-cite dataset was created for the paper: “On the Use of Context for Predicting Citation Worthiness of Sentences in Scholarly Articles” published in NAACL 2021. This dataset contains over 2.7 million sentences extracted from scholarly articles (from ACL Anthology [Bird et al.]) and their corresponding citation worthiness labels. The goal of the citation worthiness task is to determine whether a given sentence requires a citation.

There are three CSV files in the dataset:

train.csv: 1,625,268 rows
dev.csv: 539,085 rows
test.csv: 542,081 rows

Each CSV file contains the following columns:

document_id: identifier of the paper the sentence was extracted from
section: name of the section the sentence was extracted from, (e.g. Abstract, Introduction, etc.)
section_id: sequential identifier of the section in the paper
paragraph_id: sequential identifier of the paragraph the sentence was extracted from
sentence: the sentence with the citations removed
raw_sentence: the raw sentence including the citations
sentence_id: sequential identifier of the sentence in the paper
label: citation worthiness label

Note: The train/dev/test splits are done at the document_id level.

Files

ACL-cite.zip

Files (162.0 MB)

Name	Size	Download all
ACL-cite.zip md5:4d9ca5797615967575cb5c4b056f1976	162.0 MB	Preview Download

Additional details

Is derived from: Dataset: https://www.aclweb.org/anthology/ (URL)

504

Views

Downloads

Show more details

	All versions	This version
Views	504	503
Downloads	61	61
Data volume	11.5 GB	11.5 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Conference

2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2021) , Mexico City, Mexico, 6-11 June, 2021

Languages

English

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: April 8, 2021
Modified: April 9, 2021

On the Use of Context for Predicting Citation Worthiness of Sentences in Scholarly Articles

Creators

Description

Files

ACL-cite.zip

Files (162.0 MB)

Additional details

Related works