Dataset Open Access

The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment

Marco Marelli; Stefano Menini; Marco Baroni; Luisa Bentivogli; Raffaella Bernardi; Roberto Zamparelli


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.2787612", 
  "language": "eng", 
  "title": "The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment", 
  "issued": {
    "date-parts": [
      [
        2014, 
        5, 
        26
      ]
    ]
  }, 
  "abstract": "<p>The SICK data set consists of about 10,000 English sentence pairs, generated starting from two existing sets: the&nbsp;<a href=\"http://nlp.cs.illinois.edu/HockenmaierGroup/data.html\">8K ImageFlickr data set</a>&nbsp;and the&nbsp;<a href=\"http://www.cs.york.ac.uk/semeval-2012/task6/index.php?id=data\">SemEval 2012 STS MSR-Video Description data set</a>. We randomly selected a subset of sentence pairs from each of these sources and we applied a 3-step generation process: first, the original sentences were normalized to remove unwanted linguistic phenomena; the normalized sentences were then expanded to obtain up to three new sentences with specific characteristics suitable to CDSM evaluation; as a last step, all the sentences generated in the expansion phase were paired with the normalized sentences in order to obtain the final data set.</p>\n\n<p>Each sentence pair was annotated for relatedness and entailment by means of crowdsourcing techniques. The&nbsp;<strong>sentence relatedness score</strong>&nbsp;(on a 5-point rating scale) provides a direct way to evaluate CDSMs, insofar as their outputs are meant to quantify the degree of semantic relatedness between sentences; the categorizations in terms of the&nbsp;<strong>entailment relation between the two sentences</strong>&nbsp;(with&nbsp;<em>entailment, contradiction</em>, and&nbsp;<em>neutral</em>&nbsp;as gold labels) is also a crucial aspect to consider, since detecting the presence of entailment is one of the traditional benchmarks of a successful semantic system.</p>\n\n<p>In the final set, gold scores for relatedness and entailment were distributed as follows: the relatednes scoring resulted in 923 pairs within the [1,2) range, 1373 pairs within the [2,3) range, 3872 pairs within the [3,4) range, and 3672 pairs within the [4,5] range; the entailment annotation led to 5595&nbsp;<em>neutral</em>&nbsp;pairs, 1424&nbsp;<em>contradiction</em>&nbsp;pairs, and 2821&nbsp;<em>entailment</em>&nbsp;pairs.</p>\n\n<p><strong>Files</strong></p>\n\n<ul>\n\t<li>SICK.zip (main file)</li>\n\t<li>SICK_Annotated.zip (a&nbsp;version of the data set annotated for the expansion rule which was used in each case)</li>\n\t<li>SICK_subsets.zip (a&nbsp;Indexes specifying further classifications, used in the JLRE 2016 publication)</li>\n</ul>\n\n<p>&nbsp;</p>", 
  "author": [
    {
      "family": "Marco Marelli"
    }, 
    {
      "family": "Stefano Menini"
    }, 
    {
      "family": "Marco Baroni"
    }, 
    {
      "family": "Luisa Bentivogli"
    }, 
    {
      "family": "Raffaella Bernardi"
    }, 
    {
      "family": "Roberto Zamparelli"
    }
  ], 
  "type": "dataset", 
  "id": "2787612"
}
364
137
views
downloads
All versions This version
Views 364365
Downloads 137137
Data volume 26.3 MB26.3 MB
Unique views 335336
Unique downloads 8484

Share

Cite as