Lexical Relations from the Wisdom of the Crowd 1.0

doi:10.5281/zenodo.291991

Published February 15, 2017 | Version v1.0

Dataset Open

Lexical Relations from the Wisdom of the Crowd 1.0

Ustalov, Dmitry¹

1. NLPub

Other:

Loukachevitch, Natalia¹

1. Moscow State University

A set of 300 most frequent nouns has been extracted from the Russian National Corpus. Then, each method or resource, including RuThes, produced at most five hypernyms, if possible. In case it is not possible, missing answers treated as empty results. This resulted in 9 322 unique non-empty subsumption pairs that have been passed for crowdsourcing annotation on the Yandex.Toloka microtask platform. Each pair has been annotated by seven different annotators whose mother tongue is Russian and the age is at least 20 by February 1, 2017.

The layout of the human intelligence task (HIT) design assumes the direct answer to a simple question: does the given pair of words represent a meaningful is-a relation? Since the crowd workers are not expert lexicographers and this question might be difficult for them, it has been rephrased as “Is it correct that a kitten is a kind of mammal?” (in Russian).

The answers have been aggregated using the Yandex.Toloka proprietary answer aggregation mechanism. As the result, 3 940 out of 9 322 pairs have been annotated as positive while the rest 5 382 have been annotated as negative.

Interestingly, the workers were more confident in negative answers rather than in the positive ones. These negative answers are extremely useful for both training and testing different relation extraction methods. To the best of our knowledge, this is the first dataset of this kind made for the Russian language using microtask-based crowdsourcing.

Notes

Ustalov, D.: Expanding Hierarchical Contexts for Constructing a Semantic Word Network. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual conference "Dialogue". Volume 1 of 2. Computational Linguistics: Practical Applications. pp. 369–381. RSUH, Moscow, Russia (2017)

Files

Files (16.6 MB)

Name	Size	Download all
lrwc-1.0-aggregated.tsv md5:83f09c5bb3f4730e9878f2182eb86088	670.0 kB	Download
lrwc-1.0-assignments.tsv md5:42dddb6bff4c709440a55b1e6e042cd9	15.9 MB	Download
toloka-isa-50-skip-300-train-hit.tsv md5:831878245bb9e5b41c08f4a5f58cc3de	6.1 kB	Download

Additional details

Cites: Book: 978-5-211-05926-9 (ISBN)
Is previous version of: Dataset: 10.5281/zenodo.546302 (DOI)
Is supplement to: Software: https://github.com/dustalov/watlink (URL); Conference paper: http://www.dialog-21.ru/media/3959/ustalovda.pdf (URL)

Ustalov, Dmitry (TBD) Expanding Hierarchical Contexts for Constructing a Semantic Word Network
Loukachevitch, Natalia (2011) Thesauri in Information Retrieval Tasks
Lyashevskaya, Olga et al. (2009) Frequency Dictionary of the Russian Language (on Russian National Corpus)

	All versions	This version
Views	384	383
Downloads	51	51
Data volume	467.9 MB	467.9 MB

Lexical Relations from the Wisdom of the Crowd 1.0

Other:

Notes

Files

Files (16.6 MB)

Additional details

Related works

References

Lexical Relations from the Wisdom of the Crowd 1.0

Creators

Contributors

Other:

Description

Notes

Files

Files (16.6 MB)

Additional details

Related works

References