Dataset Open Access

Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection

Tahmasebi, Nina; Hengchen, Simon; Schlechtweg, Dominik; McGillivray, Barbara; Dubossarsky, Haim


DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
  <identifier identifierType="DOI">10.5281/zenodo.3730550</identifier>
  <creators>
    <creator>
      <creatorName>Tahmasebi, Nina</creatorName>
      <givenName>Nina</givenName>
      <familyName>Tahmasebi</familyName>
      <affiliation>Språkbanken, University of Gothenburg</affiliation>
    </creator>
    <creator>
      <creatorName>Hengchen, Simon</creatorName>
      <givenName>Simon</givenName>
      <familyName>Hengchen</familyName>
      <affiliation>University of Helsinki</affiliation>
    </creator>
    <creator>
      <creatorName>Schlechtweg, Dominik</creatorName>
      <givenName>Dominik</givenName>
      <familyName>Schlechtweg</familyName>
      <affiliation>IMS, University of Stuttgart</affiliation>
    </creator>
    <creator>
      <creatorName>McGillivray, Barbara</creatorName>
      <givenName>Barbara</givenName>
      <familyName>McGillivray</familyName>
      <affiliation>The Alan Turing Institute</affiliation>
    </creator>
    <creator>
      <creatorName>Dubossarsky, Haim</creatorName>
      <givenName>Haim</givenName>
      <familyName>Dubossarsky</familyName>
      <affiliation>University of Cambridge</affiliation>
    </creator>
  </creators>
  <titles>
    <title>Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</title>
  </titles>
  <publisher>Zenodo</publisher>
  <publicationYear>2020</publicationYear>
  <subjects>
    <subject>unsupervised lexical semantic change detection, semantic change, SemEval2020, Kubhist2</subject>
  </subjects>
  <dates>
    <date dateType="Issued">2020-02-19</date>
  </dates>
  <language>sv</language>
  <resourceType resourceTypeGeneral="Dataset"/>
  <alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/3730550</alternateIdentifier>
  </alternateIdentifiers>
  <relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.3672949</relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="URL" relationType="IsPartOf">https://zenodo.org/communities/natural-language-processing</relatedIdentifier>
  </relatedIdentifiers>
  <version>v2</version>
  <rightsList>
    <rights rightsURI="https://creativecommons.org/licenses/by/2.0/legalcode">Creative Commons Attribution 2.0 Generic</rights>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
  </rightsList>
  <descriptions>
    <description descriptionType="Abstract">&lt;p&gt;This data collection contains the Swedish test data for &lt;a href="https://competitions.codalab.org/competitions/20948"&gt;SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;- a Swedish text corpus pair (`corpus1/`, `corpus2/`)&lt;br&gt;
- 31 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)&lt;br&gt;
- the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`)&lt;/p&gt;

&lt;p&gt;We sample from the KubHist2 corpus, digitized by the National Library of Sweden, and available through the Spr&amp;aring;kbanken corpus infrastructure Korp (&lt;a href="https://www.researchgate.net/profile/Markus_Forsberg/publication/266352576_Korp_-_the_corpus_infrastructure_of_Sprakbanken/links/55bf1ee008aed621de121ba3/Korp-the-corpus-infrastructure-of-Sprakbanken.pdf"&gt;Borin et al., 2012&lt;/a&gt;). The full corpus is available through a CC BY (attribution) license. Each word for which the lemmatizer in the Korp pipelien has found a lemma is replaced with the lemma. In cases where the lemmatizer cannot find a lemma, we leave the word as is (i.e., unlemmatized, no lower-casing). KubHist contains very frequent OCR errors, especially for the older data.More detail about the properties and quality of the Kubhist corpus can be found in (&lt;a href="https://www.diva-portal.org/smash/get/diva2:1358014/FULLTEXT01.pdf#page=28"&gt;Adesam et al., 2019&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Lars Borin, Markus Forsberg, and Johan Roxendal. &amp;quot;Korp-the corpus infrastructure of Spr&amp;aring;kbanken.&amp;quot; &lt;em&gt;LREC&lt;/em&gt;. 2012.&lt;/p&gt;

&lt;p&gt;Adesam, Yvonne, Dana Dann&amp;eacute;lls, and Nina Tahmasebi. &amp;quot;Exploring the Quality of the Digital Historical Newspaper Archive KubHist.&amp;quot; &lt;em&gt;DHN&lt;/em&gt;. 2019.&lt;/p&gt;

&lt;p&gt;__Corpus 1__&lt;/p&gt;

&lt;p&gt;- based on: &lt;a href="https://spraakbanken.gu.se/korp/?mode=kubhist"&gt;Kubhist2&lt;/a&gt;&lt;br&gt;
- language: Swedish&lt;br&gt;
- time covered: 1790-1830&lt;br&gt;
- size: ~71 million tokens&lt;br&gt;
- format: lemmatized, sentence length &amp;gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled&lt;br&gt;
- encoding: UTF-8&lt;br&gt;
- note: contains frequent OCR errors&lt;/p&gt;

&lt;p&gt;__Corpus 2__&lt;/p&gt;

&lt;p&gt;- based on:&amp;nbsp;&lt;a href="https://spraakbanken.gu.se/korp/?mode=kubhist"&gt;Kubhist2&lt;/a&gt;&lt;br&gt;
- language: Swedish&lt;br&gt;
- time covered: 1895-1903&lt;br&gt;
- size: ~111 million tokens&lt;br&gt;
- format: lemmatized, sentence length &amp;gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled&lt;br&gt;
- encoding: UTF-8&lt;br&gt;
- note: contains OCR errors&lt;/p&gt;

&lt;p&gt;Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (`corpus1/token/`, `corpus2/token/`). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;Reference:&lt;/p&gt;

&lt;p&gt;Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.&lt;a href="https://competitions.codalab.org/competitions/20948"&gt;SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection&lt;/a&gt;. To appear in SemEval@COLING2020.&lt;/p&gt;</description>
    <description descriptionType="Other">The creation of the data was supported by the project Towards Computational Lexical Semantic Change Detection funded  by a project grant from the Swedish Research Council  (2019–2022;   dnr  2018-01184). 
It has also been created as part of the effort to construct and develop a Swedish national research infrastructure in support of research based on language data. This infrastructure -- Nationella språkbanken (the Swedish National Language Bank) -- is jointly funded for the period 2018--2024 by the Swedish Research Council (grant number 2017-00626) and its 10 partner institutions.</description>
    <description descriptionType="Other">{"references": ["Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020."]}</description>
  </descriptions>
</resource>
777
1,016
views
downloads
All versions This version
Views 777188
Downloads 1,01694
Data volume 502.4 GB94.2 GB
Unique views 689156
Unique downloads 60081

Share

Cite as