00000nam##2200000uu#4500 4301658 doi 10.5281/zenodo.4301658 oai:zenodo.org:4301658 Tahmasebi, Nina (orcid)0000-0003-1688-1845 University of Gothenburg A collection of Swedish diachronic word embedding models trained on historical newspaper data Hengchen, Simon (orcid)0000-0002-8453-7221 University of Gothenburg info:eu-repo/semantics/openAccess Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 spdx A collection of Swedish diachronic word embedding models trained on historical newspaper data Simon Hengchen, Nina Tahmasebi NOTE: this README.md is a summary. For all details, see the paper at <a href="https://doi.org/10.5334/johd.22">https://doi.org/10.5334/johd.22</a> NOTE: this data release is available on Zenodo at <a href="https://zenodo.org/record/4301658">https://zenodo.org/record/4301658</a> Description This is the data release accompanying the Journal of Open Humanities Data paper "A collection of Swedish diachronic word embedding models trained on historical newspaper data." This paper describes the creation of several word embedding models based on a large collection of diachronic Swedish newspaper material available through Språkbanken Text, the Swedish language bank. This data was produced in the context of Språkbanken Text's continued mission to collaborate with humanities and natural language processing researchers and to provide freely available language resources, for the development of state-of-the-art NLP methods and tools. Bibtex If you use the models or the code provided in this paper, please cite the following: <pre><code>@article{hengchen-tahmasebi-2021-collection, title = "A collection of {S}wedish diachronic word embedding models trained on historical newspaper data", author = "Hengchen, Simon and Tahmasebi, Nina", journal = "Journal of Open Humanities Data", year = "2021", pages = {1--7}, volume = {7}, number = {2}, doi = {10.5334/johd.22} } </code></pre> Overview We release diachronic word2vec and fastText models in their skip-gram with negative sampling (SGNS) architecture. The models are trained on 20-year time bins, with two temporal alignment strategies: independently-trained models for post-hoc alignment, and incremental training. The independently-trained models are NOT aligned, leaving the <a href="https://github.com/Garrafao/LSCDetection/tree/master/alignment">choice of alignment</a> to the end user. We provide code to reproduce our pipeline, and code examples to load and use the models. Data The entirety of the Kunglinga bibliotekets historiska tidningar (Kubhist 2) corpus was used. The original data was scanned and OCRed by the National Library of Sweden. It consists of Swedish newspapers from all parts of Sweden. It has since been run through the Sparv annotation pipeline by Martin Hammarstedt at Språkbanken Text. Preprocessing The text was retrieved from the original XML. The processing steps prior to training the models are: <ul> <li>lowercasing</li> <li>removal of digits</li> <li>removal of all characters not belonging to the Swedish alphabet (a-zäåö)</li> <li>removal of tokens the length of which is two characters or smaller</li> <li>merging of all texts pertaining to the same double decade (1740-1759; 1760-1779; ...)</li> </ul> Quality control All models have been queried for some control analogies by a native speaker of Swedish. A (non-native speaker of Swedish) reviewer, whom we thank, also performed checks on the local neighbourhoods of selected terms, performed vector arithmetics, and confirmed the models behaved as expected. Structure <pre><code>ROOT/ README.md code/ *.py files requirements.txt fasttext/ incremental/ *.ft files *.npy files indep/ *.ft files *.npy files word2vec/ incremental/ *.w2v files *.npy files indep/ *.w2v files *.npy files </code></pre> Regarding the code: <ul> <li><code>kubhist_XML_to_gensim.py</code> will transform the XML into "LineSentence", "clean" corpora</li> <li><code>train_w2v-ft.py</code> will train models</li> <li><code>load_run_models.py</code> will print some examples of what can be done with embeddings</li> <li><code>utils.py</code> contains the functions called by the scripts above</li> <li><code>requirements.txt</code> contains the output of <code>pip freeze > requirements.txt</code>, i.e. the python libraries needed to run the scripts above</li> </ul> Funding This work has been funded in part by the project <a href="https://languagechange.org/">Towards Computational Lexical Semantic Change Detection</a> supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and Nationella Språkbanken (the Swedish National Language Bank) -- jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its 10 partner institutions, to NT. swe Zenodo 2020-12-01 info:eu-repo/semantics/article 20210127154053.0 16249535451 md5:26c174380d0c9c4bdf552bbe0e5ef325 https://zenodo.org/records/4301658/files/HENGCHEN-TAHMASEBI_-_2020_-_Kubhist2_diachronic_embeddings.zip open 10.5281/zenodo.4274481 isVersionOf doi Journal of Open Humanities Data 2020-12-01