Dataset Open Access
Tahmasebi, Nina; Hengchen, Simon; Schlechtweg, Dominik; McGillivray, Barbara; Dubossarsky, Haim
{ "files": [ { "links": { "self": "https://zenodo.org/api/files/369b9e86-5d89-4a21-a99a-53875e4b2bb7/semeval2020_ulscd_swe.zip" }, "checksum": "md5:02ccc30b1a340d97eff255df3451efc9", "bucket": "369b9e86-5d89-4a21-a99a-53875e4b2bb7", "key": "semeval2020_ulscd_swe.zip", "type": "zip", "size": 442719439 } ], "owners": [ 68379 ], "doi": "10.5281/zenodo.3672950", "stats": { "version_unique_downloads": 3472.0, "unique_views": 785.0, "views": 857.0, "version_views": 1409.0, "unique_downloads": 550.0, "version_unique_views": 1232.0, "volume": 424567942001.0, "version_downloads": 3982.0, "downloads": 959.0, "version_volume": 3455085931391.0 }, "links": { "doi": "https://doi.org/10.5281/zenodo.3672950", "conceptdoi": "https://doi.org/10.5281/zenodo.3672949", "bucket": "https://zenodo.org/api/files/369b9e86-5d89-4a21-a99a-53875e4b2bb7", "conceptbadge": "https://zenodo.org/badge/doi/10.5281/zenodo.3672949.svg", "html": "https://zenodo.org/record/3672950", "latest_html": "https://zenodo.org/record/3730550", "badge": "https://zenodo.org/badge/doi/10.5281/zenodo.3672950.svg", "latest": "https://zenodo.org/api/records/3730550" }, "conceptdoi": "10.5281/zenodo.3672949", "created": "2020-02-18T20:38:08.123465+00:00", "updated": "2020-06-29T11:54:30.811116+00:00", "conceptrecid": "3672949", "revision": 5, "id": 3672950, "metadata": { "access_right_category": "success", "doi": "10.5281/zenodo.3672950", "description": "<p>This data collection contains the Swedish test data for <a href=\"https://languagechange.org/semeval\">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:</a></p>\n\n<p>- a Swedish text corpus pair (`corpus1/`, `corpus2/`)<br>\n- 31 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)</p>\n\n<p>We sample from the KubHist2 corpus, digitized by the National Library of Sweden, and available through the Språkbanken corpus infrastructure Korp (<a href=\"https://www.researchgate.net/profile/Markus_Forsberg/publication/266352576_Korp_-_the_corpus_infrastructure_of_Sprakbanken/links/55bf1ee008aed621de121ba3/Korp-the-corpus-infrastructure-of-Sprakbanken.pdf\">Borin et al., 2012</a>). The full corpus is available through a CC BY (attribution) license. Each word for which the lemmatizer in the Korp pipelien has found a lemma is replaced with the lemma. In cases where the lemmatizer cannot find a lemma, we leave the word as is (i.e., unlemmatized, no lower-casing). KubHist contains very frequent OCR errors, especially for the older data.More detail about the properties and quality of the Kubhist corpus can be found in (<a href=\"https://www.diva-portal.org/smash/get/diva2:1358014/FULLTEXT01.pdf#page=28\">Adesam et al., 2019</a>).</p>\n\n<p>Lars Borin, Markus Forsberg, and Johan Roxendal. "Korp-the corpus infrastructure of Språkbanken." <em>LREC</em>. 2012.</p>\n\n<p>Adesam, Yvonne, Dana Dannélls, and Nina Tahmasebi. "Exploring the Quality of the Digital Historical Newspaper Archive KubHist." <em>DHN</em>. 2019.</p>\n\n<p>Corpus 1</p>\n\n<p>- based on: <a href=\"https://spraakbanken.gu.se/korp/?mode=kubhist\">Kubhist2</a><br>\n- language: Swedish<br>\n- time covered: 1790-1830<br>\n- size: ~71 million tokens<br>\n- format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br>\n- encoding: UTF-8<br>\n- note: contains frequent OCR errors</p>\n\n<p>Corpus 2</p>\n\n<p>- based on: <a href=\"https://spraakbanken.gu.se/korp/?mode=kubhist\">Kubhist2</a><br>\n- language: Swedish<br>\n- time covered: 1895-1903<br>\n- size: ~111 million tokens<br>\n- format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br>\n- encoding: UTF-8<br>\n- note: contains OCR errors</p>\n\n<p>Find more information on the data and SemEval2020 Task 1 in the paper referenced below.</p>\n\n<p>Reference:</p>\n\n<p>Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.<a href=\"https://competitions.codalab.org/competitions/20948\">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</a>. To appear in SemEval@COLING2020.</p>", "language": "swe", "title": "Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection", "license": { "id": "CC-BY-2.0" }, "notes": "The creation of the data was supported by the project Towards Computational Lexical Semantic Change Detection funded by a project grant from the Swedish Research Council (2019\u20132022; dnr 2018-01184). \nIt has also been created as part of the effort to construct and develop a Swedish national research infrastructure in support of research based on language data. This infrastructure -- Nationella spr\u00e5kbanken (the Swedish National Language Bank) -- is jointly funded for the period 2018--2024 by the Swedish Research Council (grant number 2017-00626) and its 10 partner institutions.", "relations": { "version": [ { "count": 2, "index": 0, "parent": { "pid_type": "recid", "pid_value": "3672949" }, "is_last": false, "last_child": { "pid_type": "recid", "pid_value": "3730550" } } ] }, "communities": [ { "id": "natural-language-processing" } ], "version": "v1", "references": [ "Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020." ], "keywords": [ "unsupervised lexical semantic change detection, semantic change, SemEval2020, Kubhist2" ], "publication_date": "2020-02-19", "creators": [ { "affiliation": "Spr\u00e5kbanken, University of Gothenburg", "name": "Tahmasebi, Nina" }, { "affiliation": "University of Helsinki", "name": "Hengchen, Simon" }, { "affiliation": "IMS, University of Stuttgart", "name": "Schlechtweg, Dominik" }, { "affiliation": "The Alan Turing Institute", "name": "McGillivray, Barbara" }, { "affiliation": "University of Cambridge", "name": "Dubossarsky, Haim" } ], "access_right": "open", "resource_type": { "type": "dataset", "title": "Dataset" }, "related_identifiers": [ { "scheme": "doi", "identifier": "10.5281/zenodo.3672949", "relation": "isVersionOf" } ] } }
All versions | This version | |
---|---|---|
Views | 1,409 | 857 |
Downloads | 3,982 | 959 |
Data volume | 3.5 TB | 424.6 GB |
Unique views | 1,232 | 785 |
Unique downloads | 3,472 | 550 |