10.5281/zenodo.3734089
https://zenodo.org/records/3734089
oai:zenodo.org:3734089
McGillivray, Barbara
Barbara
McGillivray
University of Cambridge
Schlechtweg, Dominik
Dominik
Schlechtweg
IMS, University of Stuttgart
Dubossarsky, Haim
Haim
Dubossarsky
University of Cambridge
Tahmasebi, Nina
Nina
Tahmasebi
University of Gothenburg
Hengchen, Simon
Simon
Hengchen
University of Helsinki
LatinISE test data for SemEval 2020 task 1
Zenodo
2020
Latin, corpus
2020-03-31
lat
10.5281/zenodo.3674098
2
Creative Commons Attribution 4.0 International
This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:
a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`)
40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)
the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`)
The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary.
__Corpus 1__
based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine
language: Latin
time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC
size: ~1.7 million tokens
format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled
encoding: UTF-8
__Corpus 2__
based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine
language: Latin
time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD
size: ~9.4 million tokens
format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled
encoding: UTF-8
Find more information on the data in the papers referenced below.
References
Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020.
McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.
UK Research and Innovation
10.13039/100014013
EP/N510129/1
The Alan Turing Institute