Dataset Open Access
Tahmasebi, Nina; Hengchen, Simon; Schlechtweg, Dominik; McGillivray, Barbara; Dubossarsky, Haim
<?xml version='1.0' encoding='UTF-8'?> <record xmlns="http://www.loc.gov/MARC21/slim"> <leader>00000nmm##2200000uu#4500</leader> <datafield tag="999" ind1="C" ind2="5"> <subfield code="x">Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020.</subfield> </datafield> <datafield tag="041" ind1=" " ind2=" "> <subfield code="a">swe</subfield> </datafield> <datafield tag="653" ind1=" " ind2=" "> <subfield code="a">unsupervised lexical semantic change detection, semantic change, SemEval2020, Kubhist2</subfield> </datafield> <controlfield tag="005">20200629115430.0</controlfield> <datafield tag="500" ind1=" " ind2=" "> <subfield code="a">The creation of the data was supported by the project Towards Computational Lexical Semantic Change Detection funded by a project grant from the Swedish Research Council (2019–2022; dnr 2018-01184). It has also been created as part of the effort to construct and develop a Swedish national research infrastructure in support of research based on language data. This infrastructure -- Nationella språkbanken (the Swedish National Language Bank) -- is jointly funded for the period 2018--2024 by the Swedish Research Council (grant number 2017-00626) and its 10 partner institutions.</subfield> </datafield> <controlfield tag="001">3672950</controlfield> <datafield tag="700" ind1=" " ind2=" "> <subfield code="u">University of Helsinki</subfield> <subfield code="a">Hengchen, Simon</subfield> </datafield> <datafield tag="700" ind1=" " ind2=" "> <subfield code="u">IMS, University of Stuttgart</subfield> <subfield code="a">Schlechtweg, Dominik</subfield> </datafield> <datafield tag="700" ind1=" " ind2=" "> <subfield code="u">The Alan Turing Institute</subfield> <subfield code="a">McGillivray, Barbara</subfield> </datafield> <datafield tag="700" ind1=" " ind2=" "> <subfield code="u">University of Cambridge</subfield> <subfield code="a">Dubossarsky, Haim</subfield> </datafield> <datafield tag="856" ind1="4" ind2=" "> <subfield code="s">442719439</subfield> <subfield code="z">md5:02ccc30b1a340d97eff255df3451efc9</subfield> <subfield code="u">https://zenodo.org/record/3672950/files/semeval2020_ulscd_swe.zip</subfield> </datafield> <datafield tag="542" ind1=" " ind2=" "> <subfield code="l">open</subfield> </datafield> <datafield tag="260" ind1=" " ind2=" "> <subfield code="c">2020-02-19</subfield> </datafield> <datafield tag="909" ind1="C" ind2="O"> <subfield code="p">openaire_data</subfield> <subfield code="p">user-natural-language-processing</subfield> <subfield code="o">oai:zenodo.org:3672950</subfield> </datafield> <datafield tag="100" ind1=" " ind2=" "> <subfield code="u">Språkbanken, University of Gothenburg</subfield> <subfield code="a">Tahmasebi, Nina</subfield> </datafield> <datafield tag="245" ind1=" " ind2=" "> <subfield code="a">Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</subfield> </datafield> <datafield tag="980" ind1=" " ind2=" "> <subfield code="a">user-natural-language-processing</subfield> </datafield> <datafield tag="540" ind1=" " ind2=" "> <subfield code="u">https://creativecommons.org/licenses/by/2.0/legalcode</subfield> <subfield code="a">Creative Commons Attribution 2.0 Generic</subfield> </datafield> <datafield tag="650" ind1="1" ind2="7"> <subfield code="a">cc-by</subfield> <subfield code="2">opendefinition.org</subfield> </datafield> <datafield tag="520" ind1=" " ind2=" "> <subfield code="a"><p>This data collection contains the Swedish test data for <a href="https://languagechange.org/semeval">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:</a></p> <p>- a Swedish text corpus pair (`corpus1/`, `corpus2/`)<br> - 31 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)</p> <p>We sample from the KubHist2 corpus, digitized by the National Library of Sweden, and available through the Spr&aring;kbanken corpus infrastructure Korp (<a href="https://www.researchgate.net/profile/Markus_Forsberg/publication/266352576_Korp_-_the_corpus_infrastructure_of_Sprakbanken/links/55bf1ee008aed621de121ba3/Korp-the-corpus-infrastructure-of-Sprakbanken.pdf">Borin et al., 2012</a>). The full corpus is available through a CC BY (attribution) license. Each word for which the lemmatizer in the Korp pipelien has found a lemma is replaced with the lemma. In cases where the lemmatizer cannot find a lemma, we leave the word as is (i.e., unlemmatized, no lower-casing). KubHist contains very frequent OCR errors, especially for the older data.More detail about the properties and quality of the Kubhist corpus can be found in (<a href="https://www.diva-portal.org/smash/get/diva2:1358014/FULLTEXT01.pdf#page=28">Adesam et al., 2019</a>).</p> <p>Lars Borin, Markus Forsberg, and Johan Roxendal. &quot;Korp-the corpus infrastructure of Spr&aring;kbanken.&quot; <em>LREC</em>. 2012.</p> <p>Adesam, Yvonne, Dana Dann&eacute;lls, and Nina Tahmasebi. &quot;Exploring the Quality of the Digital Historical Newspaper Archive KubHist.&quot; <em>DHN</em>. 2019.</p> <p>Corpus 1</p> <p>- based on: <a href="https://spraakbanken.gu.se/korp/?mode=kubhist">Kubhist2</a><br> - language: Swedish<br> - time covered: 1790-1830<br> - size: ~71 million tokens<br> - format: lemmatized, sentence length &gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br> - encoding: UTF-8<br> - note: contains frequent OCR errors</p> <p>Corpus 2</p> <p>- based on:&nbsp;<a href="https://spraakbanken.gu.se/korp/?mode=kubhist">Kubhist2</a><br> - language: Swedish<br> - time covered: 1895-1903<br> - size: ~111 million tokens<br> - format: lemmatized, sentence length &gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br> - encoding: UTF-8<br> - note: contains OCR errors</p> <p>Find more information on the data and SemEval2020 Task 1 in the paper referenced below.</p> <p>Reference:</p> <p>Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.<a href="https://competitions.codalab.org/competitions/20948">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</a>. To appear in SemEval@COLING2020.</p></subfield> </datafield> <datafield tag="773" ind1=" " ind2=" "> <subfield code="n">doi</subfield> <subfield code="i">isVersionOf</subfield> <subfield code="a">10.5281/zenodo.3672949</subfield> </datafield> <datafield tag="024" ind1=" " ind2=" "> <subfield code="a">10.5281/zenodo.3672950</subfield> <subfield code="2">doi</subfield> </datafield> <datafield tag="980" ind1=" " ind2=" "> <subfield code="a">dataset</subfield> </datafield> </record>
All versions | This version | |
---|---|---|
Views | 1,407 | 855 |
Downloads | 3,741 | 959 |
Data volume | 3.2 TB | 424.6 GB |
Unique views | 1,230 | 783 |
Unique downloads | 3,233 | 550 |