# The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment

Marco Marelli; Stefano Menini; Marco Baroni; Luisa Bentivogli; Raffaella Bernardi; Roberto Zamparelli

<p>The SICK data set consists of about 10,000 English sentence pairs, generated starting from two existing sets: the 8K ImageFlickr data set and the SemEval 2012 STS MSR-Video Description data set. We randomly selected a subset of sentence pairs from each of these sources and we applied a 3-step generation process: first, the original sentences were normalized to remove unwanted linguistic phenomena; the normalized sentences were then expanded to obtain up to three new sentences with specific characteristics suitable to CDSM evaluation; as a last step, all the sentences generated in the expansion phase were paired with the normalized sentences in order to obtain the final data set.</p>

<p>Each sentence pair was annotated for relatedness and entailment by means of crowdsourcing techniques. The sentence relatedness score (on a 5-point rating scale) provides a direct way to evaluate CDSMs, insofar as their outputs are meant to quantify the degree of semantic relatedness between sentences; the categorizations in terms of the entailment relation between the two sentences (with entailment, contradiction, and neutral as gold labels) is also a crucial aspect to consider, since detecting the presence of entailment is one of the traditional benchmarks of a successful semantic system.</p>

<p>In the final set, gold scores for relatedness and entailment were distributed as follows: the relatednes scoring resulted in 923 pairs within the [1,2) range, 1373 pairs within the [2,3) range, 3872 pairs within the [3,4) range, and 3672 pairs within the [4,5] range; the entailment annotation led to 5595 neutral pairs, 1424 contradiction pairs, and 2821 entailment pairs.</p>

<p><strong>Files</strong></p>

<ul>
	<li>SICK.zip (main file)</li>
	<li>SICK_Annotated.zip (a version of the data set annotated for the expansion rule which was used in each case)</li>
	<li>SICK_subsets.zip (a Indexes specifying further classifications, used in the JLRE 2016 publication)</li>
</ul>
