A collection of Swedish diachronic word embedding models trained on historical newspaper data
Description
A collection of Swedish diachronic word embedding models trained on historical newspaper data
Simon Hengchen, Nina Tahmasebi
NOTE: this README.md is a summary. For all details, see the paper at https://doi.org/10.5334/johd.22
NOTE: this data release is available on Zenodo at https://zenodo.org/record/4301658
Description
This is the data release accompanying the Journal of Open Humanities Data paper "A collection of Swedish diachronic word embedding models trained on historical newspaper data." This paper describes the creation of several word embedding models based on a large collection of diachronic Swedish newspaper material available through Språkbanken Text, the Swedish language bank. This data was produced in the context of Språkbanken Text's continued mission to collaborate with humanities and natural language processing researchers and to provide freely available language resources, for the development of state-of-the-art NLP methods and tools.
Bibtex
If you use the models or the code provided in this paper, please cite the following:
@article{hengchen-tahmasebi-2021-collection,
title = "A collection of {S}wedish diachronic word embedding models trained on historical newspaper data",
author = "Hengchen, Simon and Tahmasebi, Nina",
journal = "Journal of Open Humanities Data",
year = "2021",
pages = {1--7},
volume = {7},
number = {2},
doi = {10.5334/johd.22}
}
Overview
We release diachronic word2vec and fastText models in their skip-gram with negative sampling (SGNS) architecture. The models are trained on 20-year time bins, with two temporal alignment strategies: independently-trained models for post-hoc alignment, and incremental training. The independently-trained models are NOT aligned, leaving the choice of alignment to the end user. We provide code to reproduce our pipeline, and code examples to load and use the models.
Data
The entirety of the Kunglinga bibliotekets historiska tidningar (Kubhist 2) corpus was used. The original data was scanned and OCRed by the National Library of Sweden. It consists of Swedish newspapers from all parts of Sweden. It has since been run through the Sparv annotation pipeline by Martin Hammarstedt at Språkbanken Text.
Preprocessing
The text was retrieved from the original XML. The processing steps prior to training the models are:
- lowercasing
- removal of digits
- removal of all characters not belonging to the Swedish alphabet (a-zäåö)
- removal of tokens the length of which is two characters or smaller
- merging of all texts pertaining to the same double decade (1740-1759; 1760-1779; ...)
Quality control
All models have been queried for some control analogies by a native speaker of Swedish. A (non-native speaker of Swedish) reviewer, whom we thank, also performed checks on the local neighbourhoods of selected terms, performed vector arithmetics, and confirmed the models behaved as expected.
Structure
ROOT/
README.md
code/
*.py files
requirements.txt
fasttext/
incremental/
*.ft files
*.npy files
indep/
*.ft files
*.npy files
word2vec/
incremental/
*.w2v files
*.npy files
indep/
*.w2v files
*.npy files
Regarding the code:
kubhist_XML_to_gensim.py
will transform the XML into "LineSentence", "clean" corporatrain_w2v-ft.py
will train modelsload_run_models.py
will print some examples of what can be done with embeddingsutils.py
contains the functions called by the scripts aboverequirements.txt
contains the output ofpip freeze > requirements.txt
, i.e. the python libraries needed to run the scripts above
Funding
This work has been funded in part by the project Towards Computational Lexical Semantic Change Detection supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and Nationella Språkbanken (the Swedish National Language Bank) -- jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its 10 partner institutions, to NT.
Files
HENGCHEN-TAHMASEBI_-_2020_-_Kubhist2_diachronic_embeddings.zip
Files
(16.2 GB)
Name | Size | Download all |
---|---|---|
md5:26c174380d0c9c4bdf552bbe0e5ef325
|
16.2 GB | Preview Download |