There is a newer version of this record available.

Dataset Open Access

Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection

Tahmasebi, Nina; Hengchen, Simon; Schlechtweg, Dominik; McGillivray, Barbara; Dubossarsky, Haim


JSON-LD (schema.org) Export

{
  "inLanguage": {
    "alternateName": "swe", 
    "@type": "Language", 
    "name": "Swedish"
  }, 
  "description": "<p>This data collection contains the Swedish test data for <a href=\"https://languagechange.org/semeval\">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:</a></p>\n\n<p>- a Swedish text corpus pair (`corpus1/`, `corpus2/`)<br>\n- 31 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)</p>\n\n<p>We sample from the KubHist2 corpus, digitized by the National Library of Sweden, and available through the Spr&aring;kbanken corpus infrastructure Korp (<a href=\"https://www.researchgate.net/profile/Markus_Forsberg/publication/266352576_Korp_-_the_corpus_infrastructure_of_Sprakbanken/links/55bf1ee008aed621de121ba3/Korp-the-corpus-infrastructure-of-Sprakbanken.pdf\">Borin et al., 2012</a>). The full corpus is available through a CC BY (attribution) license. Each word for which the lemmatizer in the Korp pipelien has found a lemma is replaced with the lemma. In cases where the lemmatizer cannot find a lemma, we leave the word as is (i.e., unlemmatized, no lower-casing). KubHist contains very frequent OCR errors, especially for the older data.More detail about the properties and quality of the Kubhist corpus can be found in (<a href=\"https://www.diva-portal.org/smash/get/diva2:1358014/FULLTEXT01.pdf#page=28\">Adesam et al., 2019</a>).</p>\n\n<p>Lars Borin, Markus Forsberg, and Johan Roxendal. &quot;Korp-the corpus infrastructure of Spr&aring;kbanken.&quot; <em>LREC</em>. 2012.</p>\n\n<p>Adesam, Yvonne, Dana Dann&eacute;lls, and Nina Tahmasebi. &quot;Exploring the Quality of the Digital Historical Newspaper Archive KubHist.&quot; <em>DHN</em>. 2019.</p>\n\n<p>Corpus 1</p>\n\n<p>- based on: <a href=\"https://spraakbanken.gu.se/korp/?mode=kubhist\">Kubhist2</a><br>\n- language: Swedish<br>\n- time covered: 1790-1830<br>\n- size: ~71 million tokens<br>\n- format: lemmatized, sentence length &gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br>\n- encoding: UTF-8<br>\n- note: contains frequent OCR errors</p>\n\n<p>Corpus 2</p>\n\n<p>- based on:&nbsp;<a href=\"https://spraakbanken.gu.se/korp/?mode=kubhist\">Kubhist2</a><br>\n- language: Swedish<br>\n- time covered: 1895-1903<br>\n- size: ~111 million tokens<br>\n- format: lemmatized, sentence length &gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br>\n- encoding: UTF-8<br>\n- note: contains OCR errors</p>\n\n<p>Find more information on the data and SemEval2020 Task 1 in the paper referenced below.</p>\n\n<p>Reference:</p>\n\n<p>Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.<a href=\"https://competitions.codalab.org/competitions/20948\">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</a>. To appear in SemEval@COLING2020.</p>", 
  "license": "https://creativecommons.org/licenses/by/2.0/legalcode", 
  "creator": [
    {
      "affiliation": "Spr\u00e5kbanken, University of Gothenburg", 
      "@type": "Person", 
      "name": "Tahmasebi, Nina"
    }, 
    {
      "affiliation": "University of Helsinki", 
      "@type": "Person", 
      "name": "Hengchen, Simon"
    }, 
    {
      "affiliation": "IMS, University of Stuttgart", 
      "@type": "Person", 
      "name": "Schlechtweg, Dominik"
    }, 
    {
      "affiliation": "The Alan Turing Institute", 
      "@type": "Person", 
      "name": "McGillivray, Barbara"
    }, 
    {
      "affiliation": "University of Cambridge", 
      "@type": "Person", 
      "name": "Dubossarsky, Haim"
    }
  ], 
  "url": "https://zenodo.org/record/3672950", 
  "datePublished": "2020-02-19", 
  "version": "v1", 
  "keywords": [
    "unsupervised lexical semantic change detection, semantic change, SemEval2020, Kubhist2"
  ], 
  "@context": "https://schema.org/", 
  "distribution": [
    {
      "contentUrl": "https://zenodo.org/api/files/369b9e86-5d89-4a21-a99a-53875e4b2bb7/semeval2020_ulscd_swe.zip", 
      "encodingFormat": "zip", 
      "@type": "DataDownload"
    }
  ], 
  "identifier": "https://doi.org/10.5281/zenodo.3672950", 
  "@id": "https://doi.org/10.5281/zenodo.3672950", 
  "@type": "Dataset", 
  "name": "Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection"
}
1,407
3,741
views
downloads
All versions This version
Views 1,407855
Downloads 3,741959
Data volume 3.2 TB424.6 GB
Unique views 1,230783
Unique downloads 3,233550

Share

Cite as