Dataset Open Access

Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection

Tahmasebi, Nina; Hengchen, Simon; Schlechtweg, Dominik; McGillivray, Barbara; Dubossarsky, Haim


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="999" ind1="C" ind2="5">
    <subfield code="x">Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020.</subfield>
  </datafield>
  <datafield tag="041" ind1=" " ind2=" ">
    <subfield code="a">swe</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">unsupervised lexical semantic change detection, semantic change, SemEval2020, Kubhist2</subfield>
  </datafield>
  <controlfield tag="005">20200629115428.0</controlfield>
  <datafield tag="500" ind1=" " ind2=" ">
    <subfield code="a">The creation of the data was supported by the project Towards Computational Lexical Semantic Change Detection funded  by a project grant from the Swedish Research Council  (2019–2022;   dnr  2018-01184). 
It has also been created as part of the effort to construct and develop a Swedish national research infrastructure in support of research based on language data. This infrastructure -- Nationella språkbanken (the Swedish National Language Bank) -- is jointly funded for the period 2018--2024 by the Swedish Research Council (grant number 2017-00626) and its 10 partner institutions.</subfield>
  </datafield>
  <controlfield tag="001">3730550</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University of Helsinki</subfield>
    <subfield code="a">Hengchen, Simon</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">IMS, University of Stuttgart</subfield>
    <subfield code="a">Schlechtweg, Dominik</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">The Alan Turing Institute</subfield>
    <subfield code="a">McGillivray, Barbara</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">University of Cambridge</subfield>
    <subfield code="a">Dubossarsky, Haim</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">1002486930</subfield>
    <subfield code="z">md5:47eb5678bbd6483969ffd904cb5e9ca8</subfield>
    <subfield code="u">https://zenodo.org/record/3730550/files/semeval2020_ulscd_swe.zip</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2020-02-19</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="p">user-natural-language-processing</subfield>
    <subfield code="o">oai:zenodo.org:3730550</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Språkbanken, University of Gothenburg</subfield>
    <subfield code="a">Tahmasebi, Nina</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-natural-language-processing</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/2.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 2.0 Generic</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;This data collection contains the Swedish test data for &lt;a href="https://competitions.codalab.org/competitions/20948"&gt;SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;- a Swedish text corpus pair (`corpus1/`, `corpus2/`)&lt;br&gt;
- 31 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)&lt;br&gt;
- the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`)&lt;/p&gt;

&lt;p&gt;We sample from the KubHist2 corpus, digitized by the National Library of Sweden, and available through the Spr&amp;aring;kbanken corpus infrastructure Korp (&lt;a href="https://www.researchgate.net/profile/Markus_Forsberg/publication/266352576_Korp_-_the_corpus_infrastructure_of_Sprakbanken/links/55bf1ee008aed621de121ba3/Korp-the-corpus-infrastructure-of-Sprakbanken.pdf"&gt;Borin et al., 2012&lt;/a&gt;). The full corpus is available through a CC BY (attribution) license. Each word for which the lemmatizer in the Korp pipelien has found a lemma is replaced with the lemma. In cases where the lemmatizer cannot find a lemma, we leave the word as is (i.e., unlemmatized, no lower-casing). KubHist contains very frequent OCR errors, especially for the older data.More detail about the properties and quality of the Kubhist corpus can be found in (&lt;a href="https://www.diva-portal.org/smash/get/diva2:1358014/FULLTEXT01.pdf#page=28"&gt;Adesam et al., 2019&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Lars Borin, Markus Forsberg, and Johan Roxendal. &amp;quot;Korp-the corpus infrastructure of Spr&amp;aring;kbanken.&amp;quot; &lt;em&gt;LREC&lt;/em&gt;. 2012.&lt;/p&gt;

&lt;p&gt;Adesam, Yvonne, Dana Dann&amp;eacute;lls, and Nina Tahmasebi. &amp;quot;Exploring the Quality of the Digital Historical Newspaper Archive KubHist.&amp;quot; &lt;em&gt;DHN&lt;/em&gt;. 2019.&lt;/p&gt;

&lt;p&gt;__Corpus 1__&lt;/p&gt;

&lt;p&gt;- based on: &lt;a href="https://spraakbanken.gu.se/korp/?mode=kubhist"&gt;Kubhist2&lt;/a&gt;&lt;br&gt;
- language: Swedish&lt;br&gt;
- time covered: 1790-1830&lt;br&gt;
- size: ~71 million tokens&lt;br&gt;
- format: lemmatized, sentence length &amp;gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled&lt;br&gt;
- encoding: UTF-8&lt;br&gt;
- note: contains frequent OCR errors&lt;/p&gt;

&lt;p&gt;__Corpus 2__&lt;/p&gt;

&lt;p&gt;- based on:&amp;nbsp;&lt;a href="https://spraakbanken.gu.se/korp/?mode=kubhist"&gt;Kubhist2&lt;/a&gt;&lt;br&gt;
- language: Swedish&lt;br&gt;
- time covered: 1895-1903&lt;br&gt;
- size: ~111 million tokens&lt;br&gt;
- format: lemmatized, sentence length &amp;gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled&lt;br&gt;
- encoding: UTF-8&lt;br&gt;
- note: contains OCR errors&lt;/p&gt;

&lt;p&gt;Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (`corpus1/token/`, `corpus2/token/`). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;Reference:&lt;/p&gt;

&lt;p&gt;Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.&lt;a href="https://competitions.codalab.org/competitions/20948"&gt;SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection&lt;/a&gt;. To appear in SemEval@COLING2020.&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.3672949</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.3730550</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
775
1,016
views
downloads
All versions This version
Views 775187
Downloads 1,01694
Data volume 502.4 GB94.2 GB
Unique views 687155
Unique downloads 60081

Share

Cite as