Dataset Open Access

The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment

Marco Marelli; Stefano Menini; Marco Baroni; Luisa Bentivogli; Raffaella Bernardi; Roberto Zamparelli


JSON-LD (schema.org) Export

{
  "inLanguage": {
    "alternateName": "eng", 
    "@type": "Language", 
    "name": "English"
  }, 
  "description": "<p>The SICK data set consists of about 10,000 English sentence pairs, generated starting from two existing sets: the&nbsp;<a href=\"http://nlp.cs.illinois.edu/HockenmaierGroup/data.html\">8K ImageFlickr data set</a>&nbsp;and the&nbsp;<a href=\"http://www.cs.york.ac.uk/semeval-2012/task6/index.php?id=data\">SemEval 2012 STS MSR-Video Description data set</a>. We randomly selected a subset of sentence pairs from each of these sources and we applied a 3-step generation process: first, the original sentences were normalized to remove unwanted linguistic phenomena; the normalized sentences were then expanded to obtain up to three new sentences with specific characteristics suitable to CDSM evaluation; as a last step, all the sentences generated in the expansion phase were paired with the normalized sentences in order to obtain the final data set.</p>\n\n<p>Each sentence pair was annotated for relatedness and entailment by means of crowdsourcing techniques. The&nbsp;<strong>sentence relatedness score</strong>&nbsp;(on a 5-point rating scale) provides a direct way to evaluate CDSMs, insofar as their outputs are meant to quantify the degree of semantic relatedness between sentences; the categorizations in terms of the&nbsp;<strong>entailment relation between the two sentences</strong>&nbsp;(with&nbsp;<em>entailment, contradiction</em>, and&nbsp;<em>neutral</em>&nbsp;as gold labels) is also a crucial aspect to consider, since detecting the presence of entailment is one of the traditional benchmarks of a successful semantic system.</p>\n\n<p>In the final set, gold scores for relatedness and entailment were distributed as follows: the relatednes scoring resulted in 923 pairs within the [1,2) range, 1373 pairs within the [2,3) range, 3872 pairs within the [3,4) range, and 3672 pairs within the [4,5] range; the entailment annotation led to 5595&nbsp;<em>neutral</em>&nbsp;pairs, 1424&nbsp;<em>contradiction</em>&nbsp;pairs, and 2821&nbsp;<em>entailment</em>&nbsp;pairs.</p>\n\n<p><strong>Files</strong></p>\n\n<ul>\n\t<li>SICK.zip (main file)</li>\n\t<li>SICK_Annotated.zip (a&nbsp;version of the data set annotated for the expansion rule which was used in each case)</li>\n\t<li>SICK_subsets.zip (a&nbsp;Indexes specifying further classifications, used in the JLRE 2016 publication)</li>\n</ul>\n\n<p>&nbsp;</p>", 
  "license": "http://creativecommons.org/licenses/by-nc-sa/3.0/legalcode", 
  "creator": [
    {
      "affiliation": "Universit\u00e0 di Milano Bicocca", 
      "@type": "Person", 
      "name": "Marco Marelli"
    }, 
    {
      "affiliation": "FBK", 
      "@type": "Person", 
      "name": "Stefano Menini"
    }, 
    {
      "affiliation": "ICREA", 
      "@type": "Person", 
      "name": "Marco Baroni"
    }, 
    {
      "affiliation": "FBK", 
      "@type": "Person", 
      "name": "Luisa Bentivogli"
    }, 
    {
      "affiliation": "Universit\u00e0 di Trento", 
      "@type": "Person", 
      "name": "Raffaella Bernardi"
    }, 
    {
      "affiliation": "Universit\u00e0 di Trento", 
      "@type": "Person", 
      "name": "Roberto Zamparelli"
    }
  ], 
  "url": "https://zenodo.org/record/2787612", 
  "datePublished": "2014-05-26", 
  "keywords": [
    "computational linguistics, entailment, sentence similarity, sentence relatedness, compositional semantics, distributional semantics"
  ], 
  "@context": "https://schema.org/", 
  "distribution": [
    {
      "contentUrl": "https://zenodo.org/api/files/e7ef41dd-f3b1-4bf6-9fcb-ceb3980069a4/SICK_Annotated.zip", 
      "@type": "DataDownload", 
      "fileFormat": "zip"
    }, 
    {
      "contentUrl": "https://zenodo.org/api/files/e7ef41dd-f3b1-4bf6-9fcb-ceb3980069a4/SICK_subsets.zip", 
      "@type": "DataDownload", 
      "fileFormat": "zip"
    }, 
    {
      "contentUrl": "https://zenodo.org/api/files/e7ef41dd-f3b1-4bf6-9fcb-ceb3980069a4/SICK.zip", 
      "@type": "DataDownload", 
      "fileFormat": "zip"
    }
  ], 
  "identifier": "https://doi.org/10.5281/zenodo.2787612", 
  "@id": "https://doi.org/10.5281/zenodo.2787612", 
  "@type": "Dataset", 
  "name": "The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment"
}
338
124
views
downloads
All versions This version
Views 338339
Downloads 124124
Data volume 23.9 MB23.9 MB
Unique views 311312
Unique downloads 7575

Share

Cite as