McGillivray, Barbara
Schlechtweg, Dominik
Dubossarsky, Haim
Tahmasebi, Nina
Hengchen, Simon
2020-03-31
<p>This data collection contains the Latin test data for <a href="https://competitions.codalab.org/competitions/20948">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</a>: </p>
<ul>
<li>a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`)</li>
<li>40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)</li>
<li>the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`)</li>
</ul>
<p>The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary.</p>
<p>__Corpus 1__</p>
<ul>
<li>based on: <a href="http://hdl.handle.net/11372/LRT-3170">LatinISE</a> (McGillivray and Kilgarriff 2013), <a href="https://app.sketchengine.eu/#dashboard?corpname=preloaded/latinise_4">version on Sketch Engine</a></li>
<li>language: Latin</li>
<li>time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC</li>
<li>size: ~1.7 million tokens</li>
<li>format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled</li>
<li>encoding: UTF-8</li>
</ul>
<p>__Corpus 2__</p>
<ul>
<li>based on: <a href="http://hdl.handle.net/11372/LRT-3170">LatinISE</a> (McGillivray and Kilgarriff 2013) , <a href="https://app.sketchengine.eu/#dashboard?corpname=preloaded/latinise_4">version on Sketch Engine</a></li>
<li>language: Latin</li>
<li>time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD</li>
<li>size: ~9.4 million tokens</li>
<li>format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled</li>
<li>encoding: UTF-8</li>
</ul>
<p>Find more information on the data in the papers referenced below.</p>
<p><strong>References</strong></p>
<p>Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi <a href="https://competitions.codalab.org/competitions/20948">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</a>. To appear in SemEval@COLING2020.</p>
<p>McGillivray, B. and Kilgarriff, A. (2013). <a href="https://www.sketchengine.co.uk/wp-content/uploads/2015/05/Latin_historical_corpus_2013.pdf">Tools for historical corpus research, and a corpus of Latin</a>. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.<br>
</p>
https://doi.org/10.5281/zenodo.3734089
oai:zenodo.org:3734089
lat
Zenodo
https://doi.org/10.5281/zenodo.3674098
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Latin, corpus
LatinISE test data for SemEval 2020 task 1
info:eu-repo/semantics/other