LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora

McGillivray, Barbara; Schlechtweg, Dominik; Dubossarsky, Haim; Tahmasebi, Nina; Hengchen, Simon

doi:10.5281/zenodo.3992738

Published August 20, 2020 | Version 3

Dataset Open

LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora

1. University of Cambridge
2. IMS, University of Stuttgart
3. University of Gothenburg
4. University of Helsinki

This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:

a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`)
40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)
the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`)

The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary.

__Corpus 1__

based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine
language: Latin
time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC
size: ~1.7 million tokens
format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled
encoding: UTF-8

__Corpus 2__

based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine
language: Latin
time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD
size: ~9.4 million tokens
format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled
encoding: UTF-8

Find more information on the data in the papers referenced below.

Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below.

The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF).

References

Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020.

McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.

Files

semeval2020_ulscd_lat.zip

Files (62.2 MB)

Name	Size	Download all
semeval2020_ulscd_lat.zip md5:83e8a6e940c79307d1767ad22dfd7e71	62.2 MB	Preview Download

Additional details

UK Research and Innovation
The Alan Turing Institute EP/N510129/1

	All versions	This version
Views	2,730	921
Downloads	915	163
Data volume	35.8 GB	10.5 GB

LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora

Authors/Creators

Description

Files

semeval2020_ulscd_lat.zip

Files (62.2 MB)

Additional details

Funding