Published July 23, 2021 | Version v1
Dataset Open

SemEval-2021 Task 12: Learning with Disagreements

  • 1. Queen Mary University of London
  • 2. Bocconi University
  • 3. Albert Heijn
  • 4. Austrian Research Institute for Artificial Intelligence
  • 5. University of Essex
  • 6. IT University of Copenhagen
  • 7. University of Bristol

Description

This repository contains the Post-Evaluation data for SemEval-2021 Task 12: Learning with Disagreement, a shared task on learning to classify with datasets containing disagreements. 

The aim of this shared task is to provide a unified testing framework for learning from disagreements using the best-known datasets containing information about disagreements for interpreting language and classifying images:

    1. LabelMe-IC: Image Classification using a subset of LabelMe images (Russell et al., 2008), is a widely used, community-created image classification dataset where images are assigned to one of 8 categories: highway, inside city, tall building, street, forest, coast, mountain, open country. Rodrigues and Pereira (2017) collected crowd labels for these images using Amazon Mechanical Turk (AMT).

    2. CIFAR10-IC: Image Classification using a subset of CIFAR-10 dataset, https://www.cs.toronto.edu/~kriz/cifar.html. The entire dataset consists of colour images in 10 categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). Crowdsourced labels for this dataset were collected by Peterson et al (2019).

    3. PDIS: Information Status Classification using Phrase Detectives Information. Information Status Classification (IS) in Phrase Detectives (Poesio et al., 2019) dataset involves identifying the information status of a noun phrase: whether that noun phrase refers to new information or to old information.

    4. Gimpel-POS: Part-of-Speech tagging using the Gimpel dataset (Gimpel et al., 2011) for Twitter posts. Plank et al.(2014b) mapped the Gimpel tags to the universal tag set (Petrov et al., 2011), using these tags as gold, and collected crowdsourced labels.

    5. Humour: ranking one-line texts using pairwise funniness judgements (Simpson et al., 2019). Crowdworkers have annotated pairs of puns to indicate which is funniest. A gold standard ranking was produced using a large number of redundant annotations. The goal is to infer the gold standard ranking from a reduced number of crowdsourced judgements.


The files contained in this data collection are as follows:
starting_kit.zip - Base models used provided for the shared task. 
practice_phase_data.zip - The training and development data used during the Practice Phase of the competition. 
test_phase_data.zip - The test data, used during the Evaluation Phase of the competition

Details of format of each dataset for each task can be found on Codalab.

Notes

This research is supported in part by the Independent Research Fund Denmark (DFF) grant 9131-00019B and 9063-00077B.

Files

SemEval-2021-Task12-Dataset.zip

Files (330.2 MB)

Name Size Download all
md5:1b496891e2231189774aae2a799f0275
330.2 MB Preview Download

Additional details

Funding

Computational Pun-derstanding M 2625
FWF Austrian Science Fund
DALI – Disagreements and Language Interpretation 695662
European Commission

References

  • Gimpel et al. (2011) Part-of-speech tagging for Twitter: Annotation, features, and experiments
  • Russell et al. (2008) LabelMe: A database and Web-based tool for image annotation.
  • Rodrigues and Pereira (2018) Deep learning from crowds
  • Krizhevsky (2009) Learning multiple layers of features from tiny images.
  • Peterson et al. (2019) Human un-certainty makes classification more robust