UPDATE: Zenodo migration postponed to Oct 13 from 06:00-08:00 UTC. Read the announcement.

Journal article Open Access

A collection of Swedish diachronic word embedding models trained on historical newspaper data

Hengchen, Simon; Tahmasebi, Nina

A collection of Swedish diachronic word embedding models trained on historical newspaper data

Simon Hengchen, Nina Tahmasebi

NOTE: this README.md is a summary. For all details, see the paper at https://doi.org/10.5334/johd.22

NOTE: this data release is available on Zenodo at https://zenodo.org/record/4301658


This is the data release accompanying the Journal of Open Humanities Data paper "A collection of Swedish diachronic word embedding models trained on historical newspaper data." This paper describes the creation of several word embedding models based on a large collection of diachronic Swedish newspaper material available through Språkbanken Text, the Swedish language bank. This data was produced in the context of Språkbanken Text's continued mission to collaborate with humanities and natural language processing researchers and to provide freely available language resources, for the development of state-of-the-art NLP methods and tools.


If you use the models or the code provided in this paper, please cite the following:

title = "A collection of {S}wedish diachronic word embedding models trained on historical newspaper data", 
author = "Hengchen, Simon and Tahmasebi, Nina", 
journal = "Journal of Open Humanities Data", 
year = "2021",
pages = {1--7},
volume = {7},
number = {2},
doi = {10.5334/johd.22}


We release diachronic word2vec and fastText models in their skip-gram with negative sampling (SGNS) architecture. The models are trained on 20-year time bins, with two temporal alignment strategies: independently-trained models for post-hoc alignment, and incremental training. The independently-trained models are NOT aligned, leaving the choice of alignment to the end user. We provide code to reproduce our pipeline, and code examples to load and use the models.


The entirety of the Kunglinga bibliotekets historiska tidningar (Kubhist 2) corpus was used. The original data was scanned and OCRed by the National Library of Sweden. It consists of Swedish newspapers from all parts of Sweden. It has since been run through the Sparv annotation pipeline by Martin Hammarstedt at Språkbanken Text.


The text was retrieved from the original XML. The processing steps prior to training the models are:

  • lowercasing
  • removal of digits
  • removal of all characters not belonging to the Swedish alphabet (a-zäåö)
  • removal of tokens the length of which is two characters or smaller
  • merging of all texts pertaining to the same double decade (1740-1759; 1760-1779; ...)

Quality control

All models have been queried for some control analogies by a native speaker of Swedish. A (non-native speaker of Swedish) reviewer, whom we thank, also performed checks on the local neighbourhoods of selected terms, performed vector arithmetics, and confirmed the models behaved as expected.


        *.py files 
            *.ft files 
            *.npy files 
            *.ft files 
            *.npy files 
            *.w2v files 
            *.npy files 
            *.w2v files 
            *.npy files 

Regarding the code:

  • kubhist_XML_to_gensim.py will transform the XML into "LineSentence", "clean" corpora
  • train_w2v-ft.py will train models
  • load_run_models.py will print some examples of what can be done with embeddings
  • utils.py contains the functions called by the scripts above
  • requirements.txt contains the output of pip freeze > requirements.txt, i.e. the python libraries needed to run the scripts above


This work has been funded in part by the project Towards Computational Lexical Semantic Change Detection supported by the Swedish Research Council (2019--2022; dnr 2018-01184), and Nationella Språkbanken (the Swedish National Language Bank) -- jointly funded by the Swedish Research Council (2018--2024; dnr 2017-00626) and its 10 partner institutions, to NT.

Files (16.2 GB)
Name Size
16.2 GB Download
All versions This version
Views 10,91010,854
Downloads 3,3523,352
Data volume 54.5 TB54.5 TB
Unique views 10,64110,613
Unique downloads 3,3333,333


Cite as