Published November 30, 2018 | Version 1.0
Dataset Restricted

Santa Cruz Ellipsis Consortium Sluicing Dataset

  • 1. UCSC

Description

This is release 1.0 of the Santa Cruz Ellipsis Consortium Sluicing Dataset, made possible by funding from the UC Santa Cruz
Institute for Humanities Research, Committee on Research, as well as NSF funding for "The Implicit Content of Sluicing".

The data comprises roughly 5000 instances of sluicing (and some related constructions) extracted from the New York Times subset of the Gigaword dataset, from years 1994 to 2000. The sluices were located by a combination of parsetree patterns and regular expressions, and we believe them to be comprehensive for those years.

The sluices were then annotated over two years by teams of 5-6 annotators, and a final adjudication over 2018 has led to the resulting dataset.

The sluices are annotated for antecedent, paraphrased elided content, and potential mismatches between the sluice paraphrase and antecedent.

In this release, there are three directories:
    Doc: More information about the extracting of the data, the annotation process, and the tagset used is available in the Data directory of the release.
    Data: The annotated data, presented as a series of jsons
    Explorer: a lightweight script for traversing the json data

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

Please provide contact information and information about how you plan to use the dataset.

You are currently not logged in. Do you have an account? Log in here