Published January 30, 2022 | Version v1
Dataset Open

Data of the Shared Task on the Disambiguation of German Verbal Idioms at KONVENS 2021

  • 1. Heinrich Heine University
  • 2. University of Tübingen

Description

This dataset was used in the Shared Task on the Disambiguation of German Verbal Idioms (VID) at KONVENS 2021. For further details, please refer to the description paper of the shared task:

Ehren, Rafael, Timm Lichte, Jakub Waszczuk & Laura Kallmeyer. 2021. Shared Task on the Disambiguation of German Verbal Idioms at KONVENS 2021. In Proceedings of the Shared Task on the Disambiguation of German Verbal Idioms at KONVENS 2021. https://doi.org/10.5281/zenodo.5730322. https://konvens.org/proceedings/2021/index.html.

Please cite this paper when using the dataset.

The content of the zip file is identical to that of the data directory in the Github repository of the shared task.

The dataset consists of 9901 instances of a German VID type or its literal counterpart in context. The set of VID types was pre-selected, thus it constitutes a lexical sample data set. It is a merger of two datasets:

The data comes in tsv files and every line has the following format:

Instance_ID \t VID_type \t label \t text

Consider this example:

T890202.28.4077	in wasser fallen	figuratively	Der Streit ums Hormonfleisch zwischen USA und EG provozierte den Polizeieinsatz . Aber nicht nur der Steakverkauf , auch die Aktionen gegen den Hormonstand , auf die sich Gruppen der Bauernopposition schon vorbereitet hatten , <b>fielen</b> <b>ins</b> <b>Wasser</b> . Die Fleischexporteure der USA wollten ihrerseits die " Grüne Woche " zur " Aufklärung " nutzen .

So the first column contains the ID (T890202.28.4077 in the example), the second the VID type (in wasser fallen), the third the label (figuratively) and the fourth the sentence with either the instance of the VID type or its literal counterpart (and two additional context sentences). The parts of the target expression are marked with the <b> tag (<b>fielen</b> <b>ins</b> <b>Wasser</b>). There are four possible labels:

  • figuratively
  • literally
  • undecidable
  • both

The first two should be self-explanatory. The label undecidable was used by the annotators if it was not possible to disambiguate an instance given the context. The label both was applied when both the literal and the idiomatic readings were active.

Files

st-vid-disambiguation-konvens-2021.zip

Files (2.2 MB)

Name Size Download all
md5:a3209bc3904ad8f345d98dbcd543128a
2.2 MB Preview Download

Additional details