Dataset Open Access

Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection

Tahmasebi, Nina; Hengchen, Simon; Schlechtweg, Dominik; McGillivray, Barbara; Dubossarsky, Haim


JSON Export

{
  "files": [
    {
      "links": {
        "self": "https://zenodo.org/api/files/98ecaeee-a5db-474b-89de-d5e2acd8ee1e/semeval2020_ulscd_swe.zip"
      }, 
      "checksum": "md5:47eb5678bbd6483969ffd904cb5e9ca8", 
      "bucket": "98ecaeee-a5db-474b-89de-d5e2acd8ee1e", 
      "key": "semeval2020_ulscd_swe.zip", 
      "type": "zip", 
      "size": 1002486930
    }
  ], 
  "owners": [
    68379
  ], 
  "doi": "10.5281/zenodo.3730550", 
  "stats": {
    "version_unique_downloads": 600.0, 
    "unique_views": 156.0, 
    "views": 188.0, 
    "version_views": 777.0, 
    "unique_downloads": 81.0, 
    "version_unique_views": 689.0, 
    "volume": 94233771420.0, 
    "version_downloads": 1016.0, 
    "downloads": 94.0, 
    "version_volume": 502421094178.0
  }, 
  "links": {
    "doi": "https://doi.org/10.5281/zenodo.3730550", 
    "conceptdoi": "https://doi.org/10.5281/zenodo.3672949", 
    "bucket": "https://zenodo.org/api/files/98ecaeee-a5db-474b-89de-d5e2acd8ee1e", 
    "conceptbadge": "https://zenodo.org/badge/doi/10.5281/zenodo.3672949.svg", 
    "html": "https://zenodo.org/record/3730550", 
    "latest_html": "https://zenodo.org/record/3730550", 
    "badge": "https://zenodo.org/badge/doi/10.5281/zenodo.3730550.svg", 
    "latest": "https://zenodo.org/api/records/3730550"
  }, 
  "conceptdoi": "10.5281/zenodo.3672949", 
  "created": "2020-03-27T09:00:49.259203+00:00", 
  "updated": "2020-06-29T11:54:28.762143+00:00", 
  "conceptrecid": "3672949", 
  "revision": 3, 
  "id": 3730550, 
  "metadata": {
    "access_right_category": "success", 
    "doi": "10.5281/zenodo.3730550", 
    "description": "<p>This data collection contains the Swedish test data for <a href=\"https://competitions.codalab.org/competitions/20948\">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:</a></p>\n\n<p>- a Swedish text corpus pair (`corpus1/`, `corpus2/`)<br>\n- 31 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)<br>\n- the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`)</p>\n\n<p>We sample from the KubHist2 corpus, digitized by the National Library of Sweden, and available through the Spr&aring;kbanken corpus infrastructure Korp (<a href=\"https://www.researchgate.net/profile/Markus_Forsberg/publication/266352576_Korp_-_the_corpus_infrastructure_of_Sprakbanken/links/55bf1ee008aed621de121ba3/Korp-the-corpus-infrastructure-of-Sprakbanken.pdf\">Borin et al., 2012</a>). The full corpus is available through a CC BY (attribution) license. Each word for which the lemmatizer in the Korp pipelien has found a lemma is replaced with the lemma. In cases where the lemmatizer cannot find a lemma, we leave the word as is (i.e., unlemmatized, no lower-casing). KubHist contains very frequent OCR errors, especially for the older data.More detail about the properties and quality of the Kubhist corpus can be found in (<a href=\"https://www.diva-portal.org/smash/get/diva2:1358014/FULLTEXT01.pdf#page=28\">Adesam et al., 2019</a>).</p>\n\n<p>Lars Borin, Markus Forsberg, and Johan Roxendal. &quot;Korp-the corpus infrastructure of Spr&aring;kbanken.&quot; <em>LREC</em>. 2012.</p>\n\n<p>Adesam, Yvonne, Dana Dann&eacute;lls, and Nina Tahmasebi. &quot;Exploring the Quality of the Digital Historical Newspaper Archive KubHist.&quot; <em>DHN</em>. 2019.</p>\n\n<p>__Corpus 1__</p>\n\n<p>- based on: <a href=\"https://spraakbanken.gu.se/korp/?mode=kubhist\">Kubhist2</a><br>\n- language: Swedish<br>\n- time covered: 1790-1830<br>\n- size: ~71 million tokens<br>\n- format: lemmatized, sentence length &gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br>\n- encoding: UTF-8<br>\n- note: contains frequent OCR errors</p>\n\n<p>__Corpus 2__</p>\n\n<p>- based on:&nbsp;<a href=\"https://spraakbanken.gu.se/korp/?mode=kubhist\">Kubhist2</a><br>\n- language: Swedish<br>\n- time covered: 1895-1903<br>\n- size: ~111 million tokens<br>\n- format: lemmatized, sentence length &gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br>\n- encoding: UTF-8<br>\n- note: contains OCR errors</p>\n\n<p>Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (`corpus1/token/`, `corpus2/token/`). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below.</p>\n\n<p>&nbsp;</p>\n\n<p>Reference:</p>\n\n<p>Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.<a href=\"https://competitions.codalab.org/competitions/20948\">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</a>. To appear in SemEval@COLING2020.</p>", 
    "language": "swe", 
    "title": "Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection", 
    "license": {
      "id": "CC-BY-2.0"
    }, 
    "notes": "The creation of the data was supported by the project Towards Computational Lexical Semantic Change Detection funded  by a project grant from the Swedish Research Council  (2019\u20132022;   dnr  2018-01184). \nIt has also been created as part of the effort to construct and develop a Swedish national research infrastructure in support of research based on language data. This infrastructure -- Nationella spr\u00e5kbanken (the Swedish National Language Bank) -- is jointly funded for the period 2018--2024 by the Swedish Research Council (grant number 2017-00626) and its 10 partner institutions.", 
    "relations": {
      "version": [
        {
          "count": 2, 
          "index": 1, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "3672949"
          }, 
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "3730550"
          }
        }
      ]
    }, 
    "communities": [
      {
        "id": "natural-language-processing"
      }
    ], 
    "version": "v2", 
    "references": [
      "Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020."
    ], 
    "keywords": [
      "unsupervised lexical semantic change detection, semantic change, SemEval2020, Kubhist2"
    ], 
    "publication_date": "2020-02-19", 
    "creators": [
      {
        "affiliation": "Spr\u00e5kbanken, University of Gothenburg", 
        "name": "Tahmasebi, Nina"
      }, 
      {
        "affiliation": "University of Helsinki", 
        "name": "Hengchen, Simon"
      }, 
      {
        "affiliation": "IMS, University of Stuttgart", 
        "name": "Schlechtweg, Dominik"
      }, 
      {
        "affiliation": "The Alan Turing Institute", 
        "name": "McGillivray, Barbara"
      }, 
      {
        "affiliation": "University of Cambridge", 
        "name": "Dubossarsky, Haim"
      }
    ], 
    "access_right": "open", 
    "resource_type": {
      "type": "dataset", 
      "title": "Dataset"
    }, 
    "related_identifiers": [
      {
        "scheme": "doi", 
        "identifier": "10.5281/zenodo.3672949", 
        "relation": "isVersionOf"
      }
    ]
  }
}
777
1,016
views
downloads
All versions This version
Views 777188
Downloads 1,01694
Data volume 502.4 GB94.2 GB
Unique views 689156
Unique downloads 60081

Share

Cite as