3672950
doi
10.5281/zenodo.3672950
oai:zenodo.org:3672950
user-natural-language-processing
Hengchen, Simon
University of Helsinki
Schlechtweg, Dominik
IMS, University of Stuttgart
McGillivray, Barbara
The Alan Turing Institute
Dubossarsky, Haim
University of Cambridge
Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection
Tahmasebi, Nina
Språkbanken, University of Gothenburg
info:eu-repo/semantics/openAccess
Creative Commons Attribution 2.0 Generic
https://creativecommons.org/licenses/by/2.0/legalcode
unsupervised lexical semantic change detection, semantic change, SemEval2020, Kubhist2
<p>This data collection contains the Swedish test data for <a href="https://languagechange.org/semeval">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:</a></p>
<p>- a Swedish text corpus pair (`corpus1/`, `corpus2/`)<br>
- 31 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)</p>
<p>We sample from the KubHist2 corpus, digitized by the National Library of Sweden, and available through the Språkbanken corpus infrastructure Korp (<a href="https://www.researchgate.net/profile/Markus_Forsberg/publication/266352576_Korp_-_the_corpus_infrastructure_of_Sprakbanken/links/55bf1ee008aed621de121ba3/Korp-the-corpus-infrastructure-of-Sprakbanken.pdf">Borin et al., 2012</a>). The full corpus is available through a CC BY (attribution) license. Each word for which the lemmatizer in the Korp pipelien has found a lemma is replaced with the lemma. In cases where the lemmatizer cannot find a lemma, we leave the word as is (i.e., unlemmatized, no lower-casing). KubHist contains very frequent OCR errors, especially for the older data.More detail about the properties and quality of the Kubhist corpus can be found in (<a href="https://www.diva-portal.org/smash/get/diva2:1358014/FULLTEXT01.pdf#page=28">Adesam et al., 2019</a>).</p>
<p>Lars Borin, Markus Forsberg, and Johan Roxendal. "Korp-the corpus infrastructure of Språkbanken." <em>LREC</em>. 2012.</p>
<p>Adesam, Yvonne, Dana Dannélls, and Nina Tahmasebi. "Exploring the Quality of the Digital Historical Newspaper Archive KubHist." <em>DHN</em>. 2019.</p>
<p>Corpus 1</p>
<p>- based on: <a href="https://spraakbanken.gu.se/korp/?mode=kubhist">Kubhist2</a><br>
- language: Swedish<br>
- time covered: 1790-1830<br>
- size: ~71 million tokens<br>
- format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br>
- encoding: UTF-8<br>
- note: contains frequent OCR errors</p>
<p>Corpus 2</p>
<p>- based on: <a href="https://spraakbanken.gu.se/korp/?mode=kubhist">Kubhist2</a><br>
- language: Swedish<br>
- time covered: 1895-1903<br>
- size: ~111 million tokens<br>
- format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br>
- encoding: UTF-8<br>
- note: contains OCR errors</p>
<p>Find more information on the data and SemEval2020 Task 1 in the paper referenced below.</p>
<p>Reference:</p>
<p>Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.<a href="https://competitions.codalab.org/competitions/20948">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</a>. To appear in SemEval@COLING2020.</p>
The creation of the data was supported by the project Towards Computational Lexical Semantic Change Detection funded by a project grant from the Swedish Research Council (2019–2022; dnr 2018-01184).
It has also been created as part of the effort to construct and develop a Swedish national research infrastructure in support of research based on language data. This infrastructure -- Nationella språkbanken (the Swedish National Language Bank) -- is jointly funded for the period 2018--2024 by the Swedish Research Council (grant number 2017-00626) and its 10 partner institutions.
Zenodo
2020-02-19
info:eu-repo/semantics/other
3672949
user-natural-language-processing
v1
1593431670.811116
442719439
md5:02ccc30b1a340d97eff255df3451efc9
https://zenodo.org/records/3672950/files/semeval2020_ulscd_swe.zip
public
10.5281/zenodo.3672949
isVersionOf
doi