Santa Cruz Ellipsis Consortium Sluicing Dataset
Description
This is release 1.0 of the Santa Cruz Ellipsis Consortium Sluicing Dataset, made possible by funding from the UC Santa Cruz
Institute for Humanities Research, Committee on Research, as well as NSF funding for "The Implicit Content of Sluicing".
The data comprises roughly 5000 instances of sluicing (and some related constructions) extracted from the New York Times subset of the Gigaword dataset, from years 1994 to 2000. The sluices were located by a combination of parsetree patterns and regular expressions, and we believe them to be comprehensive for those years.
The sluices were then annotated over two years by teams of 5-6 annotators, and a final adjudication over 2018 has led to the resulting dataset.
The sluices are annotated for antecedent, paraphrased elided content, and potential mismatches between the sluice paraphrase and antecedent.
In this release, there are three directories:
Doc: More information about the extracting of the data, the annotation process, and the tagset used is available in the Data directory of the release.
Data: The annotated data, presented as a series of jsons
Explorer: a lightweight script for traversing the json data