Dataset Open Access

SemEval-2021 Task 12: Learning with Disagreements

Uma, Alexandra Nnemamaka; Fornaciari, Tommaso; Dumitrache, Anca; Miller, Tristan; Chamberlain, Jon; Plank, Barbara; Simpson, Edwin; Poesio, Massimo

This repository contains the Post-Evaluation data for SemEval-2021 Task 12: Learning with Disagreement, a shared task on learning to classify with datasets containing disagreements. 

The aim of this shared task is to provide a unified testing framework for learning from disagreements using the best-known datasets containing information about disagreements for interpreting language and classifying images:

    1. LabelMe-IC: Image Classification using a subset of LabelMe images (Russell et al., 2008), is a widely used, community-created image classification dataset where images are assigned to one of 8 categories: highway, inside city, tall building, street, forest, coast, mountain, open country. Rodrigues and Pereira (2017) collected crowd labels for these images using Amazon Mechanical Turk (AMT).

    2. CIFAR10-IC: Image Classification using a subset of CIFAR-10 dataset, The entire dataset consists of colour images in 10 categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). Crowdsourced labels for this dataset were collected by Peterson et al (2019).

    3. PDIS: Information Status Classification using Phrase Detectives Information. Information Status Classification (IS) in Phrase Detectives (Poesio et al., 2019) dataset involves identifying the information status of a noun phrase: whether that noun phrase refers to new information or to old information.

    4. Gimpel-POS: Part-of-Speech tagging using the Gimpel dataset (Gimpel et al., 2011) for Twitter posts. Plank et al.(2014b) mapped the Gimpel tags to the universal tag set (Petrov et al., 2011), using these tags as gold, and collected crowdsourced labels.

    5. Humour: ranking one-line texts using pairwise funniness judgements (Simpson et al., 2019). Crowdworkers have annotated pairs of puns to indicate which is funniest. A gold standard ranking was produced using a large number of redundant annotations. The goal is to infer the gold standard ranking from a reduced number of crowdsourced judgements.

The files contained in this data collection are as follows: - Base models used provided for the shared task. - The training and development data used during the Practice Phase of the competition. - The test data, used during the Evaluation Phase of the competition

Details of format of each dataset for each task can be found on Codalab.

This research is supported in part by the Independent Research Fund Denmark (DFF) grant 9131-00019B and 9063-00077B.
Files (330.2 MB)
Name Size
330.2 MB Download
  • Gimpel et al. (2011) Part-of-speech tagging for Twitter: Annotation, features, and experiments

  • Russell et al. (2008) LabelMe: A database and Web-based tool for image annotation.

  • Rodrigues and Pereira (2018) Deep learning from crowds

  • Krizhevsky (2009) Learning multiple layers of features from tiny images.

  • Peterson et al. (2019) Human un-certainty makes classification more robust

All versions This version
Views 180180
Downloads 2525
Data volume 8.3 GB8.3 GB
Unique views 166166
Unique downloads 2323


Cite as