Dataset Open Access

The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment

Marco Marelli; Stefano Menini; Marco Baroni; Luisa Bentivogli; Raffaella Bernardi; Roberto Zamparelli


JSON Export

{
  "files": [
    {
      "links": {
        "self": "https://zenodo.org/api/files/e7ef41dd-f3b1-4bf6-9fcb-ceb3980069a4/SICK_Annotated.zip"
      }, 
      "checksum": "md5:1c0e709f59f92e4cfa4b0a613027ac1b", 
      "bucket": "e7ef41dd-f3b1-4bf6-9fcb-ceb3980069a4", 
      "key": "SICK_Annotated.zip", 
      "type": "zip", 
      "size": 246005
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/e7ef41dd-f3b1-4bf6-9fcb-ceb3980069a4/SICK_subsets.zip"
      }, 
      "checksum": "md5:eeef24604e01a05e76406b290e241359", 
      "bucket": "e7ef41dd-f3b1-4bf6-9fcb-ceb3980069a4", 
      "key": "SICK_subsets.zip", 
      "type": "zip", 
      "size": 16083
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/e7ef41dd-f3b1-4bf6-9fcb-ceb3980069a4/SICK.zip"
      }, 
      "checksum": "md5:b03210036cc64a4cf1bc7a0525357001", 
      "bucket": "e7ef41dd-f3b1-4bf6-9fcb-ceb3980069a4", 
      "key": "SICK.zip", 
      "type": "zip", 
      "size": 217584
    }
  ], 
  "owners": [
    67146
  ], 
  "doi": "10.5281/zenodo.2787612", 
  "stats": {
    "version_unique_downloads": 73.0, 
    "unique_views": 303.0, 
    "views": 330.0, 
    "downloads": 119.0, 
    "unique_downloads": 73.0, 
    "version_unique_views": 302.0, 
    "volume": 22914053.0, 
    "version_downloads": 119.0, 
    "version_views": 329.0, 
    "version_volume": 22914053.0
  }, 
  "links": {
    "doi": "https://doi.org/10.5281/zenodo.2787612", 
    "conceptdoi": "https://doi.org/10.5281/zenodo.2787611", 
    "bucket": "https://zenodo.org/api/files/e7ef41dd-f3b1-4bf6-9fcb-ceb3980069a4", 
    "conceptbadge": "https://zenodo.org/badge/doi/10.5281/zenodo.2787611.svg", 
    "html": "https://zenodo.org/record/2787612", 
    "latest_html": "https://zenodo.org/record/2787612", 
    "badge": "https://zenodo.org/badge/doi/10.5281/zenodo.2787612.svg", 
    "latest": "https://zenodo.org/api/records/2787612"
  }, 
  "conceptdoi": "10.5281/zenodo.2787611", 
  "created": "2019-05-13T12:47:32.305048+00:00", 
  "updated": "2019-05-13T13:33:16.762103+00:00", 
  "conceptrecid": "2787611", 
  "revision": 2, 
  "id": 2787612, 
  "metadata": {
    "access_right_category": "success", 
    "doi": "10.5281/zenodo.2787612", 
    "description": "<p>The SICK data set consists of about 10,000 English sentence pairs, generated starting from two existing sets: the&nbsp;<a href=\"http://nlp.cs.illinois.edu/HockenmaierGroup/data.html\">8K ImageFlickr data set</a>&nbsp;and the&nbsp;<a href=\"http://www.cs.york.ac.uk/semeval-2012/task6/index.php?id=data\">SemEval 2012 STS MSR-Video Description data set</a>. We randomly selected a subset of sentence pairs from each of these sources and we applied a 3-step generation process: first, the original sentences were normalized to remove unwanted linguistic phenomena; the normalized sentences were then expanded to obtain up to three new sentences with specific characteristics suitable to CDSM evaluation; as a last step, all the sentences generated in the expansion phase were paired with the normalized sentences in order to obtain the final data set.</p>\n\n<p>Each sentence pair was annotated for relatedness and entailment by means of crowdsourcing techniques. The&nbsp;<strong>sentence relatedness score</strong>&nbsp;(on a 5-point rating scale) provides a direct way to evaluate CDSMs, insofar as their outputs are meant to quantify the degree of semantic relatedness between sentences; the categorizations in terms of the&nbsp;<strong>entailment relation between the two sentences</strong>&nbsp;(with&nbsp;<em>entailment, contradiction</em>, and&nbsp;<em>neutral</em>&nbsp;as gold labels) is also a crucial aspect to consider, since detecting the presence of entailment is one of the traditional benchmarks of a successful semantic system.</p>\n\n<p>In the final set, gold scores for relatedness and entailment were distributed as follows: the relatednes scoring resulted in 923 pairs within the [1,2) range, 1373 pairs within the [2,3) range, 3872 pairs within the [3,4) range, and 3672 pairs within the [4,5] range; the entailment annotation led to 5595&nbsp;<em>neutral</em>&nbsp;pairs, 1424&nbsp;<em>contradiction</em>&nbsp;pairs, and 2821&nbsp;<em>entailment</em>&nbsp;pairs.</p>\n\n<p><strong>Files</strong></p>\n\n<ul>\n\t<li>SICK.zip (main file)</li>\n\t<li>SICK_Annotated.zip (a&nbsp;version of the data set annotated for the expansion rule which was used in each case)</li>\n\t<li>SICK_subsets.zip (a&nbsp;Indexes specifying further classifications, used in the JLRE 2016 publication)</li>\n</ul>\n\n<p>&nbsp;</p>", 
    "license": {
      "id": "CC-BY-NC-SA-3.0"
    }, 
    "title": "The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment", 
    "relations": {
      "version": [
        {
          "count": 1, 
          "index": 0, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "2787611"
          }, 
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "2787612"
          }
        }
      ]
    }, 
    "language": "eng", 
    "grants": [
      {
        "code": "283554", 
        "links": {
          "self": "https://zenodo.org/api/grants/10.13039/501100000780::283554"
        }, 
        "title": "Compositional Operations in Semantic Space", 
        "acronym": "COMPOSES", 
        "program": "FP7", 
        "funder": {
          "doi": "10.13039/501100000780", 
          "acronyms": [
            "EC"
          ], 
          "name": "European Commission", 
          "links": {
            "self": "https://zenodo.org/api/funders/10.13039/501100000780"
          }
        }
      }
    ], 
    "references": [
      "L. Bentivogli, R. Bernardi, M. Marelli, S. Menini, M. Baroni and R. Zamparelli (2016). SICK Through the SemEval Glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Journal of Language Resources and Evaluation, 50(1), 95-124", 
      "M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi and R. Zamparelli (2014). A SICK cure for the evaluation of compositional distributional semantic models. Proceedings of LREC 2014, Reykjavik (Iceland): ELRA, 216-223."
    ], 
    "keywords": [
      "computational linguistics, entailment, sentence similarity, sentence relatedness, compositional semantics, distributional semantics"
    ], 
    "publication_date": "2014-05-26", 
    "creators": [
      {
        "affiliation": "Universit\u00e0 di Milano Bicocca", 
        "name": "Marco Marelli"
      }, 
      {
        "affiliation": "FBK", 
        "name": "Stefano Menini"
      }, 
      {
        "affiliation": "ICREA", 
        "name": "Marco Baroni"
      }, 
      {
        "affiliation": "FBK", 
        "name": "Luisa Bentivogli"
      }, 
      {
        "affiliation": "Universit\u00e0 di Trento", 
        "name": "Raffaella Bernardi"
      }, 
      {
        "affiliation": "Universit\u00e0 di Trento", 
        "name": "Roberto Zamparelli"
      }
    ], 
    "access_right": "open", 
    "resource_type": {
      "type": "dataset", 
      "title": "Dataset"
    }, 
    "related_identifiers": [
      {
        "scheme": "doi", 
        "relation": "isVersionOf", 
        "identifier": "10.5281/zenodo.2787611"
      }
    ]
  }
}
329
119
views
downloads
All versions This version
Views 329330
Downloads 119119
Data volume 22.9 MB22.9 MB
Unique views 302303
Unique downloads 7373

Share

Cite as