There is a newer version of the record available.

Published December 3, 2020 | Version skoltech2020
Presentation Open

Graphs, Computation, and Language

  • 1. Yandex

Description

Graphs and networks offer a convenient way to study systems around us, including such complex ones as human language. Graph-based representations are proven to be an effective approach for a wide variety of Natural Language Processing (NLP) tasks. In this course we will seek answers to three questions: (1) how to express the linguistic phenomena as graphs, (2) how to gain knowledge based on them, and (3) how to assess the quality of this knowledge. Since most methods described in this course are unsupervised, a special attention is paid to their thorough assessment using both automatic metrics and human judgements, including crowdsourcing. The target audience of this course are advanced graduate students, data analysts, and researchers in NLP and IR (but it is not limited to them).

Graph clustering allows to extract useful knowledge by exploiting the implicit structure of the data. In this lecture we will introduce the problem of non-overlapping and overlapping graph clustering. We will demonstrate and elaborately describe several efficient clustering algorithms for both these problems widely used in NLP, including Chinese Whispers and Markov Clustering for non-overlapping clustering (aka partitioning), and MaxMax and Watset for overlapping (aka fuzzy) clustering. We will show their strengths and weaknesses as well as their implementations and successful applications in word sense and frame induction. Then, we will focus on evaluation of unsupervised NLP methods using pairwise precision, recall, F-score, (inverse) purity and its modifications. Finally, we will discuss the randomization-based statistical tests of these measures, algorithm choice, and useful language resources for further studies.

Crowdsourcing is an efficient approach for knowledge acquisition and data annotation that enables gathering and evaluating large-scale linguistic datasets. In this lecture we will focus on practical use of human-assisted computation for language resource construction and evaluation. We will analyze three established approaches for crowdsourcing in NLP. First, we will consider the case study of Wikipedia and Wiktionary that facilitate the community effort using automatic quality control via content assessment and edit patrolling. Second, we will dive deep in microtask-based crowdsourcing using reCAPTCHA and Mechanical Turk as the examples. We will discuss task design and decomposition issues and then carefully describe standard approaches for inter-annotator agreement evaluation (Krippendorff's α) and answer aggregation (Majority Vote and Dawid-Skene). Third, we will study the case of various games with a purpose, including ESP Game, Infection Game for BabelNet, and OpenCorpora gamification. Finally, we will provide recommendations for ensuring the high quality of the crowdsourced annotation and show useful datasets for further studies.

Files

Crowdsourcing.pdf

Files (15.0 MB)

Name Size Download all
md5:e645b540f709450437d42413281125e2
9.5 MB Preview Download
md5:cb799ee7dd02e99e8071b40cce615047
5.5 MB Preview Download

Additional details

Related works

Is new version of
Presentation: 10.5281/zenodo.3960805 (DOI)
Presentation: 10.5281/zenodo.1161505 (DOI)