Dataset Open Access
Tahmasebi, Nina; Hengchen, Simon; Schlechtweg, Dominik; McGillivray, Barbara; Dubossarsky, Haim
<?xml version='1.0' encoding='utf-8'?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:adms="http://www.w3.org/ns/adms#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dct="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:dcat="http://www.w3.org/ns/dcat#" xmlns:duv="http://www.w3.org/ns/duv#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:frapo="http://purl.org/cerif/frapo/" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:gsp="http://www.opengis.net/ont/geosparql#" xmlns:locn="http://www.w3.org/ns/locn#" xmlns:org="http://www.w3.org/ns/org#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:prov="http://www.w3.org/ns/prov#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:schema="http://schema.org/" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:vcard="http://www.w3.org/2006/vcard/ns#" xmlns:wdrs="http://www.w3.org/2007/05/powder-s#"> <rdf:Description rdf:about="https://doi.org/10.5281/zenodo.3672950"> <rdf:type rdf:resource="http://www.w3.org/ns/dcat#Dataset"/> <dct:type rdf:resource="http://purl.org/dc/dcmitype/Dataset"/> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://doi.org/10.5281/zenodo.3672950</dct:identifier> <foaf:page rdf:resource="https://doi.org/10.5281/zenodo.3672950"/> <dct:creator> <rdf:Description> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <foaf:name>Tahmasebi, Nina</foaf:name> <foaf:givenName>Nina</foaf:givenName> <foaf:familyName>Tahmasebi</foaf:familyName> <org:memberOf> <foaf:Organization> <foaf:name>Språkbanken, University of Gothenburg</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:creator> <rdf:Description> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <foaf:name>Hengchen, Simon</foaf:name> <foaf:givenName>Simon</foaf:givenName> <foaf:familyName>Hengchen</foaf:familyName> <org:memberOf> <foaf:Organization> <foaf:name>University of Helsinki</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:creator> <rdf:Description> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <foaf:name>Schlechtweg, Dominik</foaf:name> <foaf:givenName>Dominik</foaf:givenName> <foaf:familyName>Schlechtweg</foaf:familyName> <org:memberOf> <foaf:Organization> <foaf:name>IMS, University of Stuttgart</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:creator> <rdf:Description> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <foaf:name>McGillivray, Barbara</foaf:name> <foaf:givenName>Barbara</foaf:givenName> <foaf:familyName>McGillivray</foaf:familyName> <org:memberOf> <foaf:Organization> <foaf:name>The Alan Turing Institute</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:creator> <rdf:Description> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <foaf:name>Dubossarsky, Haim</foaf:name> <foaf:givenName>Haim</foaf:givenName> <foaf:familyName>Dubossarsky</foaf:familyName> <org:memberOf> <foaf:Organization> <foaf:name>University of Cambridge</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:title>Swedish Test Data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</dct:title> <dct:publisher> <foaf:Agent> <foaf:name>Zenodo</foaf:name> </foaf:Agent> </dct:publisher> <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#gYear">2020</dct:issued> <dcat:keyword>unsupervised lexical semantic change detection, semantic change, SemEval2020, Kubhist2</dcat:keyword> <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2020-02-19</dct:issued> <dct:language rdf:resource="http://publications.europa.eu/resource/authority/language/SWE"/> <owl:sameAs rdf:resource="https://zenodo.org/record/3672950"/> <adms:identifier> <adms:Identifier> <skos:notation rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://zenodo.org/record/3672950</skos:notation> <adms:schemeAgency>url</adms:schemeAgency> </adms:Identifier> </adms:identifier> <dct:isVersionOf rdf:resource="https://doi.org/10.5281/zenodo.3672949"/> <dct:isPartOf rdf:resource="https://zenodo.org/communities/natural-language-processing"/> <owl:versionInfo>v1</owl:versionInfo> <dct:description><p>This data collection contains the Swedish test data for <a href="https://languagechange.org/semeval">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:</a></p> <p>- a Swedish text corpus pair (`corpus1/`, `corpus2/`)<br> - 31 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)</p> <p>We sample from the KubHist2 corpus, digitized by the National Library of Sweden, and available through the Spr&aring;kbanken corpus infrastructure Korp (<a href="https://www.researchgate.net/profile/Markus_Forsberg/publication/266352576_Korp_-_the_corpus_infrastructure_of_Sprakbanken/links/55bf1ee008aed621de121ba3/Korp-the-corpus-infrastructure-of-Sprakbanken.pdf">Borin et al., 2012</a>). The full corpus is available through a CC BY (attribution) license. Each word for which the lemmatizer in the Korp pipelien has found a lemma is replaced with the lemma. In cases where the lemmatizer cannot find a lemma, we leave the word as is (i.e., unlemmatized, no lower-casing). KubHist contains very frequent OCR errors, especially for the older data.More detail about the properties and quality of the Kubhist corpus can be found in (<a href="https://www.diva-portal.org/smash/get/diva2:1358014/FULLTEXT01.pdf#page=28">Adesam et al., 2019</a>).</p> <p>Lars Borin, Markus Forsberg, and Johan Roxendal. &quot;Korp-the corpus infrastructure of Spr&aring;kbanken.&quot; <em>LREC</em>. 2012.</p> <p>Adesam, Yvonne, Dana Dann&eacute;lls, and Nina Tahmasebi. &quot;Exploring the Quality of the Digital Historical Newspaper Archive KubHist.&quot; <em>DHN</em>. 2019.</p> <p>Corpus 1</p> <p>- based on: <a href="https://spraakbanken.gu.se/korp/?mode=kubhist">Kubhist2</a><br> - language: Swedish<br> - time covered: 1790-1830<br> - size: ~71 million tokens<br> - format: lemmatized, sentence length &gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br> - encoding: UTF-8<br> - note: contains frequent OCR errors</p> <p>Corpus 2</p> <p>- based on:&nbsp;<a href="https://spraakbanken.gu.se/korp/?mode=kubhist">Kubhist2</a><br> - language: Swedish<br> - time covered: 1895-1903<br> - size: ~111 million tokens<br> - format: lemmatized, sentence length &gt; 9 (before removal of punctuation), no punctuation, sentences randomly shuffled<br> - encoding: UTF-8<br> - note: contains OCR errors</p> <p>Find more information on the data and SemEval2020 Task 1 in the paper referenced below.</p> <p>Reference:</p> <p>Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.<a href="https://competitions.codalab.org/competitions/20948">SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection</a>. To appear in SemEval@COLING2020.</p></dct:description> <dct:description>The creation of the data was supported by the project Towards Computational Lexical Semantic Change Detection funded by a project grant from the Swedish Research Council (2019–2022; dnr 2018-01184). It has also been created as part of the effort to construct and develop a Swedish national research infrastructure in support of research based on language data. This infrastructure -- Nationella språkbanken (the Swedish National Language Bank) -- is jointly funded for the period 2018--2024 by the Swedish Research Council (grant number 2017-00626) and its 10 partner institutions.</dct:description> <dct:description>{"references": ["Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi.SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020."]}</dct:description> <dct:accessRights rdf:resource="http://publications.europa.eu/resource/authority/access-right/PUBLIC"/> <dct:accessRights> <dct:RightsStatement rdf:about="info:eu-repo/semantics/openAccess"> <rdfs:label>Open Access</rdfs:label> </dct:RightsStatement> </dct:accessRights> <dcat:distribution> <dcat:Distribution> <dct:license rdf:resource="https://creativecommons.org/licenses/by/2.0/legalcode"/> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.3672950"/> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.3672950"/> <dcat:byteSize>442719439</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/3672950/files/semeval2020_ulscd_swe.zip"/> <dcat:mediaType>application/zip</dcat:mediaType> </dcat:Distribution> </dcat:distribution> </rdf:Description> </rdf:RDF>
All versions | This version | |
---|---|---|
Views | 1,408 | 856 |
Downloads | 3,841 | 959 |
Data volume | 3.3 TB | 424.6 GB |
Unique views | 1,231 | 784 |
Unique downloads | 3,331 | 550 |